diff --git a/.github/workflows/gh-pages.yml b/.github/workflows/gh-pages.yml
index 17234b6390..e9de057c2c 100644
--- a/.github/workflows/gh-pages.yml
+++ b/.github/workflows/gh-pages.yml
@@ -15,7 +15,7 @@ jobs:
- uses: actions/setup-python@v5
with:
python-version: 3.x
- - run: pip install mkdocs-material mkdocs-get-deps mkdocs-material-extensions mkdocs-multilang
+ - run: pip install mkdocs-material mkdocs-get-deps mkdocs-material-extensions mkdocs-multilang mkdocs-static-i18n
- name: Deploy to GitHub Pages
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
diff --git a/README.md b/README.md
index 8ddb61add2..0c20629ffc 100644
--- a/README.md
+++ b/README.md
@@ -1,3 +1,4 @@
+English | [简体中文](README_CN.md)
@@ -22,11 +23,10 @@
--------------------------------------------------------------------------------
-# FastDeploy 2.0: Inference and Deployment Toolkit for LLMs and VLMs based on PaddlePaddle
+# FastDeploy : Inference and Deployment Toolkit for LLMs and VLMs based on PaddlePaddle
## News
-
-**[2025-07] 《FastDeploy2.0推理部署实测》专题活动已上线!** 完成文心4.5系列开源模型的推理部署等任务,即可获得骨瓷马克杯等FastDeploy2.0官方周边及丰富奖金!🎁 欢迎大家体验反馈~ 📌[报名地址](https://www.wjx.top/vm/meSsp3L.aspx#) 📌[活动详情](https://github.com/PaddlePaddle/FastDeploy/discussions/2728)
+**[2025-08] 🔥 Released FastDeploy v2.1:** A brand-new KV Cache scheduling strategy has been introduced, and expanded support for PD separation and CUDA Graph across more models. Enhanced hardware support has been added for platforms like Kunlun and Hygon, along with comprehensive optimizations to improve the performance of both the service and inference engine.
**[2025-07] The FastDeploy 2.0 Inference Deployment Challenge is now live!** Complete the inference deployment task for the ERNIE 4.5 series open-source models to win official FastDeploy 2.0 merch and generous prizes! 🎁 You're welcome to try it out and share your feedback! 📌[Sign up here](https://www.wjx.top/vm/meSsp3L.aspx#) 📌[Event details](https://github.com/PaddlePaddle/FastDeploy/discussions/2728)
@@ -50,14 +50,15 @@
## Installation
-FastDeploy supports inference deployment on **NVIDIA GPUs**, **Kunlunxin XPUs**, **Iluvatar GPUs**, **Enflame GCUs**, and other hardware. For detailed installation instructions:
+FastDeploy supports inference deployment on **NVIDIA GPUs**, **Kunlunxin XPUs**, **Iluvatar GPUs**, **Enflame GCUs**, **Hygon DCUs** and other hardware. For detailed installation instructions:
- [NVIDIA GPU](./docs/get_started/installation/nvidia_gpu.md)
- [Kunlunxin XPU](./docs/get_started/installation/kunlunxin_xpu.md)
- [Iluvatar GPU](./docs/get_started/installation/iluvatar_gpu.md)
- [Enflame GCU](./docs/get_started/installation/Enflame_gcu.md)
+- [Hygon DCU](./docs/get_started/installation/hygon_dcu.md)
-**Note:** We are actively working on expanding hardware support. Additional hardware platforms including Ascend NPU, Hygon DCU, and MetaX GPU are currently under development and testing. Stay tuned for updates!
+**Note:** We are actively working on expanding hardware support. Additional hardware platforms including Ascend NPU and MetaX GPU are currently under development and testing. Stay tuned for updates!
## Get Started
@@ -68,18 +69,19 @@ Learn how to use FastDeploy through our documentation:
- [Offline Inference Development](./docs/offline_inference.md)
- [Online Service Deployment](./docs/online_serving/README.md)
- [Full Supported Models List](./docs/supported_models.md)
+- [Best Practices](./docs/best_practices/README.md)
## Supported Models
| Model | Data Type | PD Disaggregation | Chunked Prefill | Prefix Caching | MTP | CUDA Graph | Maximum Context Length |
|:--- | :------- | :---------- | :-------- | :-------- | :----- | :----- | :----- |
-|ERNIE-4.5-300B-A47B | BF16/WINT4/WINT8/W4A8C8/WINT2/FP8 | ✅| ✅ | ✅|✅(WINT4)| WIP |128K |
-|ERNIE-4.5-300B-A47B-Base| BF16/WINT4/WINT8 | ✅| ✅ | ✅|✅(WINT4)| WIP | 128K |
+|ERNIE-4.5-300B-A47B | BF16/WINT4/WINT8/W4A8C8/WINT2/FP8 | ✅| ✅ | ✅|✅| ✅ |128K |
+|ERNIE-4.5-300B-A47B-Base| BF16/WINT4/WINT8 | ✅| ✅ | ✅|❌| ✅ | 128K |
|ERNIE-4.5-VL-424B-A47B | BF16/WINT4/WINT8 | WIP | ✅ | WIP | ❌ | WIP |128K |
|ERNIE-4.5-VL-28B-A3B | BF16/WINT4/WINT8 | ❌ | ✅ | WIP | ❌ | WIP |128K |
-|ERNIE-4.5-21B-A3B | BF16/WINT4/WINT8/FP8 | ❌ | ✅ | ✅ | WIP | ✅|128K |
-|ERNIE-4.5-21B-A3B-Base | BF16/WINT4/WINT8/FP8 | ❌ | ✅ | ✅ | WIP | ✅|128K |
-|ERNIE-4.5-0.3B | BF16/WINT8/FP8 | ❌ | ✅ | ✅ | ❌ | ✅| 128K |
+|ERNIE-4.5-21B-A3B | BF16/WINT4/WINT8/FP8 | ❌ | ✅ | ✅ | ✅ | ✅|128K |
+|ERNIE-4.5-21B-A3B-Base | BF16/WINT4/WINT8/FP8 | ✅ | ✅ | ✅ | ❌ | ✅|128K |
+|ERNIE-4.5-0.3B | BF16/WINT8/FP8 | ✅ | ✅ | ✅ | ❌ | ✅| 128K |
## Advanced Usage
diff --git a/README_CN.md b/README_CN.md
new file mode 100644
index 0000000000..6cebc527a2
--- /dev/null
+++ b/README_CN.md
@@ -0,0 +1,94 @@
+[English](README.md) | 简体中文
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ 安装指导
+ |
+ 快速入门
+ |
+ 支持模型列表
+
+
+
+--------------------------------------------------------------------------------
+# FastDeploy :基于飞桨的大语言模型与视觉语言模型推理部署工具包
+
+## 最新活动
+**[2025-08] 🔥 FastDeploy v2.1 全新发布:** 全新的KV Cache调度策略,更多模型支持PD分离和CUDA Graph,昆仑、海光等更多硬件支持增强,全方面优化服务和推理引擎的性能。
+
+**[2025-07] 《FastDeploy2.0推理部署实测》专题活动已上线!** 完成文心4.5系列开源模型的推理部署等任务,即可获得骨瓷马克杯等FastDeploy2.0官方周边及丰富奖金!🎁 欢迎大家体验反馈~ 📌[报名地址](https://www.wjx.top/vm/meSsp3L.aspx#) 📌[活动详情](https://github.com/PaddlePaddle/FastDeploy/discussions/2728)
+
+## 关于
+
+**FastDeploy** 是基于飞桨(PaddlePaddle)的大语言模型(LLM)与视觉语言模型(VLM)推理部署工具包,提供**开箱即用的生产级部署方案**,核心技术特性包括:
+
+- 🚀 **负载均衡式PD分解**:工业级解决方案,支持上下文缓存与动态实例角色切换,在保障SLO达标和吞吐量的同时优化资源利用率
+- 🔄 **统一KV缓存传输**:轻量级高性能传输库,支持智能NVLink/RDMA选择
+- 🤝 **OpenAI API服务与vLLM兼容**:单命令部署,兼容[vLLM](https://github.com/vllm-project/vllm/)接口
+- 🧮 **全量化格式支持**:W8A16、W8A8、W4A16、W4A8、W2A16、FP8等
+- ⏩ **高级加速技术**:推测解码、多令牌预测(MTP)及分块预填充
+- 🖥️ **多硬件支持**:NVIDIA GPU、昆仑芯XPU、海光DCU、昇腾NPU、天数智芯GPU、燧原GCU、沐曦GPU等
+
+## 要求
+
+- 操作系统: Linux
+- Python: 3.10 ~ 3.12
+
+## 安装
+
+FastDeploy 支持在**英伟达(NVIDIA)GPU**、**昆仑芯(Kunlunxin)XPU**、**天数(Iluvatar)GPU**、**燧原(Enflame)GCU**、**海光(Hygon)DCU** 以及其他硬件上进行推理部署。详细安装说明如下:
+
+- [英伟达 GPU](./docs/zh/get_started/installation/nvidia_gpu.md)
+- [昆仑芯 XPU](./docs/zh/get_started/installation/kunlunxin_xpu.md)
+- [天数 CoreX](./docs/zh/get_started/installation/iluvatar_gpu.md)
+- [燧原 S60](./docs/zh/get_started/installation/Enflame_gcu.md)
+- [海光 DCU](./docs/zh/get_started/installation/hygon_dcu.md)
+
+**注意:** 我们正在积极拓展硬件支持范围。目前,包括昇腾(Ascend)NPU 和 沐曦(MetaX)GPU 在内的其他硬件平台正在开发测试中。敬请关注更新!
+
+## 入门指南
+
+通过我们的文档了解如何使用 FastDeploy:
+- [10分钟快速部署](./docs/zh/get_started/quick_start.md)
+- [ERNIE-4.5 部署](./docs/zh/get_started/ernie-4.5.md)
+- [ERNIE-4.5-VL 部署](./docs/zh/get_started/ernie-4.5-vl.md)
+- [离线推理](./docs/zh/offline_inference.md)
+- [在线服务](./docs/zh/online_serving/README.md)
+- [模型支持列表](./docs/zh/supported_models.md)
+- [最佳实践](./docs/zh/best_practices/README.md)
+
+## 支持模型列表
+
+| Model | Data Type | PD Disaggregation | Chunked Prefill | Prefix Caching | MTP | CUDA Graph | Maximum Context Length |
+|:--- | :------- | :---------- | :-------- | :-------- | :----- | :----- | :----- |
+|ERNIE-4.5-300B-A47B | BF16/WINT4/WINT8/W4A8C8/WINT2/FP8 | ✅| ✅ | ✅|✅| ✅ |128K |
+|ERNIE-4.5-300B-A47B-Base| BF16/WINT4/WINT8 | ✅| ✅ | ✅|❌| ✅ | 128K |
+|ERNIE-4.5-VL-424B-A47B | BF16/WINT4/WINT8 | WIP | ✅ | WIP | ❌ | WIP |128K |
+|ERNIE-4.5-VL-28B-A3B | BF16/WINT4/WINT8 | ❌ | ✅ | WIP | ❌ | WIP |128K |
+|ERNIE-4.5-21B-A3B | BF16/WINT4/WINT8/FP8 | ❌ | ✅ | ✅ | ✅ | ✅|128K |
+|ERNIE-4.5-21B-A3B-Base | BF16/WINT4/WINT8/FP8 | ✅ | ✅ | ✅ | ❌ | ✅|128K |
+|ERNIE-4.5-0.3B | BF16/WINT8/FP8 | ✅ | ✅ | ✅ | ❌ | ✅| 128K |
+
+## 进阶用法
+
+- [量化](./docs/zh/quantization/README.md)
+- [分离式部署](./docs/zh/features/disaggregated.md)
+- [投机解码](./docs/zh/features/speculative_decoding.md)
+- [前缀缓存](./docs/zh/features/prefix_caching.md)
+- [分块预填充](./docs/zh/features/chunked_prefill.md)
+
+## 致谢
+
+FastDeploy 依据 [Apache-2.0 开源许可证](./LICENSE). 进行授权。在开发过程中,我们参考并借鉴了 [vLLM](https://github.com/vllm-project/vllm) 的部分代码,以保持接口兼容性,在此表示衷心感谢。
diff --git a/dockerfiles/Dockerfile.gpu b/dockerfiles/Dockerfile.gpu
index 9e1d97834e..6a31156ff1 100644
--- a/dockerfiles/Dockerfile.gpu
+++ b/dockerfiles/Dockerfile.gpu
@@ -1,4 +1,4 @@
-FROM ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/fastdeploy-cuda-12.6:2.0.0
+FROM ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/fastdeploy-cuda-12.6:2.1.0
ARG PADDLE_VERSION=3.1.1
ARG FD_VERSION=2.1.0
diff --git a/docs/optimal_deployment/ERNIE-4.5-0.3B-Paddle.md b/docs/best_practices/ERNIE-4.5-0.3B-Paddle.md
similarity index 99%
rename from docs/optimal_deployment/ERNIE-4.5-0.3B-Paddle.md
rename to docs/best_practices/ERNIE-4.5-0.3B-Paddle.md
index 7c08369653..333cd38437 100644
--- a/docs/optimal_deployment/ERNIE-4.5-0.3B-Paddle.md
+++ b/docs/best_practices/ERNIE-4.5-0.3B-Paddle.md
@@ -2,7 +2,8 @@
## Environmental Preparation
### 1.1 Hardware requirements
The minimum number of GPUs required to deploy `ERNIE-4.5-0.3B` on the following hardware for each quantization is as follows:
-| | WINT8 | WINT4 | FP8 |
+
+| | WINT8 | WINT4 | FP8 |
|-----|-----|-----|-----|
|H800 80GB| 1 | 1 | 1 |
|A800 80GB| 1 | 1 | / |
diff --git a/docs/optimal_deployment/ERNIE-4.5-21B-A3B-Paddle.md b/docs/best_practices/ERNIE-4.5-21B-A3B-Paddle.md
similarity index 97%
rename from docs/optimal_deployment/ERNIE-4.5-21B-A3B-Paddle.md
rename to docs/best_practices/ERNIE-4.5-21B-A3B-Paddle.md
index ff7b9102ed..6c8fb2d5d2 100644
--- a/docs/optimal_deployment/ERNIE-4.5-21B-A3B-Paddle.md
+++ b/docs/best_practices/ERNIE-4.5-21B-A3B-Paddle.md
@@ -2,7 +2,8 @@
## Environmental Preparation
### 1.1 Hardware requirements
The minimum number of GPUs required to deploy `ERNIE-4.5-21B-A3B` on the following hardware for each quantization is as follows:
-| | WINT8 | WINT4 | FP8 |
+
+| | WINT8 | WINT4 | FP8 |
|-----|-----|-----|-----|
|H800 80GB| 1 | 1 | 1 |
|A800 80GB| 1 | 1 | / |
@@ -110,7 +111,6 @@ export INFERENCE_MSG_QUEUE_ID=1315
export FLAGS_max_partition_size=2048
export FD_ATTENTION_BACKEND=FLASH_ATTN
export FD_LOG_DIR="prefill_log"
-export ENABLE_V1_KVCACHE_SCHEDULER=1
quant_type=block_wise_fp8
export FD_USE_DEEP_GEMM=0
@@ -120,7 +120,7 @@ python -m fastdeploy.entrypoints.openai.api_server --model baidu/ERNIE-4.5-21B-A
--max-num-seqs 20 \
--num-gpu-blocks-override 40000 \
--quantization ${quant_type} \
- --gpu-memory-utilization 0.9 \
+ --gpu-memory-utilization 0.9 --kv-cache-ratio 0.9 \
--port 7012 --engine-worker-queue-port 7013 --metrics-port 7014 --tensor-parallel-size 4 \
--cache-queue-port 7015 \
--splitwise-role "prefill" \
@@ -131,7 +131,6 @@ export CUDA_VISIBLE_DEVICES=4,5,6,7
export INFERENCE_MSG_QUEUE_ID=1215
export FLAGS_max_partition_size=2048
export FD_LOG_DIR="decode_log"
-export ENABLE_V1_KVCACHE_SCHEDULER=1
quant_type=block_wise_fp8
export FD_USE_DEEP_GEMM=0
@@ -140,7 +139,7 @@ python -m fastdeploy.entrypoints.openai.api_server --model baidu/ERNIE-4.5-21B-A
--max-model-len 131072 \
--max-num-seqs 20 \
--quantization ${quant_type} \
- --gpu-memory-utilization 0.85 \
+ --gpu-memory-utilization 0.85 --kv-cache-ratio 0.1 \
--port 9012 --engine-worker-queue-port 8013 --metrics-port 8014 --tensor-parallel-size 4 \
--cache-queue-port 8015 \
--innode-prefill-ports 7013 \
diff --git a/docs/optimal_deployment/ERNIE-4.5-300B-A47B-Paddle.md b/docs/best_practices/ERNIE-4.5-300B-A47B-Paddle.md
similarity index 98%
rename from docs/optimal_deployment/ERNIE-4.5-300B-A47B-Paddle.md
rename to docs/best_practices/ERNIE-4.5-300B-A47B-Paddle.md
index 7ae51bf75a..6a8b5af7a1 100644
--- a/docs/optimal_deployment/ERNIE-4.5-300B-A47B-Paddle.md
+++ b/docs/best_practices/ERNIE-4.5-300B-A47B-Paddle.md
@@ -2,7 +2,8 @@
## Environmental Preparation
### 1.1 Hardware requirements
The minimum number of GPUs required to deploy `ERNIE-4.5-300B-A47B` on the following hardware for each quantization is as follows:
-| | WINT8 | WINT4 | FP8 | WINT2 | W4A8 |
+
+| | WINT8 | WINT4 | FP8 | WINT2 | W4A8 |
|-----|-----|-----|-----|-----|-----|
|H800 80GB| 8 | 4 | 8 | 2 | 4 |
|A800 80GB| 8 | 4 | / | 2 | 4 |
@@ -98,7 +99,6 @@ export FD_SAMPLING_CLASS=rejection
**How to enable:** Take the deployment of a single machine with 8 GPUs and 1P1D (4 GPUs each) as an example. Compared with the default hybrid deployment method, `--splitwise-role` is required to specify the role of the node. And the GPUs and logs of the two nodes are isolated through the environment variables `FD_LOG_DIR` and `CUDA_VISIBLE_DEVICES`.
```
export FD_LOG_DIR="log_prefill"
-export ENABLE_V1_KVCACHE_SCHEDULER=1
export CUDA_VISIBLE_DEVICES=0,1,2,3
python -m fastdeploy.entrypoints.openai.api_server \
--model baidu/ERNIE-4.5-300B-A47B-Paddle \
@@ -111,7 +111,6 @@ python -m fastdeploy.entrypoints.openai.api_server \
```
```
export FD_LOG_DIR="log_decode"
-export ENABLE_V1_KVCACHE_SCHEDULER=1
export CUDA_VISIBLE_DEVICES=4,5,6,7
# Note that innode-prefill-ports is specified as the Prefill serviceengine-worker-queue-port
python -m fastdeploy.entrypoints.openai.api_server \
diff --git a/docs/best_practices/ERNIE-4.5-VL-28B-A3B-Paddle.md b/docs/best_practices/ERNIE-4.5-VL-28B-A3B-Paddle.md
new file mode 100644
index 0000000000..3fc933fb2d
--- /dev/null
+++ b/docs/best_practices/ERNIE-4.5-VL-28B-A3B-Paddle.md
@@ -0,0 +1,134 @@
+
+# ERNIE-4.5-VL-28B-A3B-Paddle
+
+## 1. Environment Preparation
+### 1.1 Support Status
+
+The minimum number of cards required for deployment on the following hardware is as follows:
+
+| Device [GPU Mem] | WINT4 | WINT8 | BFLOAT16 |
+|:----------:|:----------:|:------:| :------:|
+| A30 [24G] | 2 | 2 | 4 |
+| L20 [48G] | 1 | 1 | 2 |
+| H20 [144G] | 1 | 1 | 1 |
+| A100 [80G] | 1 | 1 | 1 |
+| H800 [80G] | 1 | 1 | 1 |
+
+### 1.2 Install Fastdeploy
+
+Installation process reference documentation [FastDeploy GPU Install](../get_started/installation/nvidia_gpu.md)
+
+> ⚠️ Precautions:
+> - FastDeploy only supports models in Paddle format – please ensure to download models with the `-Paddle` file extension.
+> - The model name will trigger an automatic download. If the model has already been downloaded, you can directly use the absolute path to the model's download location.
+
+## 2.How to Use
+### 2.1 Basic: Launching the Service
+**Example 1:** Deploying a 32K Context Service on a Single RTX 4090 GPU
+```shell
+export ENABLE_V1_KVCACHE_SCHEDULER=1
+python -m fastdeploy.entrypoints.openai.api_server \
+ --model baidu/ERNIE-4.5-VL-28B-A3B-Paddle \
+ --port 8180 \
+ --metrics-port 8181 \
+ --engine-worker-queue-port 8182 \
+ --tensor-parallel-size 1 \
+ --max-model-len 32768 \
+ --max-num-seqs 256 \
+ --limit-mm-per-prompt '{"image": 100, "video": 100}' \
+ --reasoning-parser ernie-45-vl \
+ --gpu-memory-utilization 0.9 \
+ --enable-chunked-prefill \
+ --max-num-batched-tokens 384 \
+ --quantization wint4 \
+ --enable-mm
+```
+**Example 2:** Deploying a 128K Context Service on Dual H800 GPUs
+```shell
+export ENABLE_V1_KVCACHE_SCHEDULER=1
+python -m fastdeploy.entrypoints.openai.api_server \
+ --model baidu/ERNIE-4.5-VL-28B-A3B-Paddle \
+ --port 8180 \
+ --metrics-port 8181 \
+ --engine-worker-queue-port 8182 \
+ --tensor-parallel-size 2 \
+ --max-model-len 131072 \
+ --max-num-seqs 256 \
+ --limit-mm-per-prompt '{"image": 100, "video": 100}' \
+ --reasoning-parser ernie-45-vl \
+ --gpu-memory-utilization 0.9 \
+ --enable-chunked-prefill \
+ --max-num-batched-tokens 384 \
+ --quantization wint4 \
+ --enable-mm
+```
+
+> ⚠️ For versions 2.1 and above, the new scheduler needs to be enabled via an environment variable `ENABLE_V1_KVCACHE_SCHEDULER=1`. Otherwise, some requests may be truncated before reaching the maximum length or return empty results.
+
+An example is a set of configurations that can run stably while also delivering relatively good performance. If you have further requirements for precision or performance, please continue reading the content below.
+### 2.2 Advanced: How to Achieve Better Performance
+
+#### 2.2.1 Evaluating Application Scenarios and Setting Parameters Correctly
+> **Context Length**
+- **Parameters:** `--max-model-len`
+- **Description:** Controls the maximum context length that the model can process.
+- **Recommendation:** Longer context lengths may reduce throughput. Adjust based on actual needs, with a maximum supported context length of **128k** (131,072).
+
+ ⚠️ Note: Longer context lengths will significantly increase GPU memory requirements. Ensure your hardware resources are sufficient before setting a longer context.
+> **Maximum sequence count**
+- **Parameters:** `--max-num-seqs`
+- **Description:** Controls the maximum number of sequences the service can handle, supporting a range of 1 to 256.
+- **Recommendation:** If you are unsure of the average number of sequences per request in your actual application scenario, we recommend setting it to **256**. If the average number of sequences per request in your application is significantly fewer than 256, we suggest setting it to a slightly higher value than the average to further reduce GPU memory usage and optimize service performance.
+
+> **Multi-image and multi-video input**
+- **Parameters**:`--limit-mm-per-prompt`
+- **Description**:Our model supports multi-image and multi-video input in a single prompt. Please use this **Parameters** setting to limit the number of images/videos per request, ensuring efficient resource utilization.
+- **Recommendation**:We recommend setting the number of images and videos in a single prompt to **100 each** to balance performance and memory usage.
+
+> **Available GPU memory ratio during initialization**
+- **Parameters:** `--gpu-memory-utilization`
+- **Description:** Controls the available GPU memory for FastDeploy service initialization. The default value is 0.9, meaning 10% of the memory is reserved for backup.
+- **Recommendation:** It is recommended to use the default value of 0.9. If an "out of memory" error occurs during stress testing, you may attempt to reduce this value.
+
+#### 2.2.2 Chunked Prefill
+- **Parameters:** `--enable-chunked-prefill`
+- **Description:** Enabling `chunked prefill` can **reduce peak GPU memory usage** and **improve service throughput**.
+- **Other relevant configurations**:
+
+ `--max-num-batched-tokens`:Limit the maximum number of tokens per chunk, with a recommended setting of 384.
+
+#### 2.2.3 **Quantization precision**
+- **Parameters:** `--quantization`
+
+- **Supported precision types:**
+ - WINT4 (Suitable for most users)
+ - WINT8
+ - BFLOAT16 (When the `--quantization` parameter is not set, BFLOAT16 is used by default.)
+
+- **Recommendation:**
+ - Unless you have extremely stringent precision requirements, we strongly recommend using WINT4 quantization. This will significantly reduce memory consumption and increase throughput.
+ - If slightly higher precision is required, you may try WINT8.
+ - Only consider using BFLOAT16 if your application scenario demands extreme precision, as it requires significantly more GPU memory.
+
+#### 2.2.4 **Adjustable environment variables**
+> **Rejection sampling:**`FD_SAMPLING_CLASS=rejection`
+- **Description:** Rejection sampling involves generating samples from a proposal distribution that is easy to sample from, thereby avoiding explicit sorting and achieving an effect of improving sampling speed, which can enhance inference performance.
+- **Recommendation:** This is a relatively aggressive optimization strategy that affects the results, and we are still conducting comprehensive validation of its impact. If you have high performance requirements and can accept potential compromises in results, you may consider enabling this strategy.
+
+> **Attention Hyperparameter:**`FLAGS_max_partition_size=1024`
+- **Description:** The hyperparameters for the Append Attention (default) backend have been tested on commonly used datasets, and our results show that setting it to 1024 can significantly improve decoding speed, especially in long-text scenarios.
+- **Recommendation:** In the future, it will be modified to an automatic adjustment mechanism. If you have high performance requirements, you may consider enabling it.
+
+## 3. FAQ
+**Note:** Deploying multimodal services requires adding parameters to the configuration `--enable-mm`.
+
+### 3.1 Out of Memory
+If the service prompts "Out of Memory" during startup, please try the following solutions:
+1. Ensure no other processes are occupying GPU memory;
+2. Use WINT4/WINT8 quantization and enable chunked prefill;
+3. Reduce context length and maximum sequence count as needed;
+4. Increase the number of GPU cards for deployment (e.g., 2 or 4 cards) by modifying the parameter `--tensor-parallel-size 2` or `--tensor-parallel-size 4`.
+
+If the service starts normally but later reports insufficient memory, try:
+1. Adjust the initial GPU memory utilization ratio by modifying `--gpu-memory-utilization`;
+2. Increase the number of deployment cards (parameter adjustment as above).
diff --git a/docs/best_practices/ERNIE-4.5-VL-424B-A47B-Paddle.md b/docs/best_practices/ERNIE-4.5-VL-424B-A47B-Paddle.md
new file mode 100644
index 0000000000..2741a417ea
--- /dev/null
+++ b/docs/best_practices/ERNIE-4.5-VL-424B-A47B-Paddle.md
@@ -0,0 +1,110 @@
+
+# ERNIE-4.5-VL-424B-A47B-Paddle
+
+## 1. Environment Preparation
+### 1.1 Support Status
+The minimum number of cards required for deployment on the following hardware is as follows:
+
+| Device [GPU Mem] | WINT4 | WINT8 | BFLOAT16 |
+|:----------:|:----------:|:------:| :------:|
+| H20 [144G] | 8 | 8 | 8 |
+| A100 [80G] | 8 | 8 | - |
+| H800 [80G] | 8 | 8 | - |
+
+### 1.2 Install Fastdeploy
+
+Installation process reference documentation [FastDeploy GPU Install](../get_started/installation/nvidia_gpu.md)
+
+> ⚠️ Precautions:
+> - FastDeploy only supports models in Paddle format – please ensure to download models with the `-Paddle` file extension.
+> - The model name will trigger an automatic download. If the model has already been downloaded, you can directly use the absolute path to the model's download location.
+
+## 2.How to Use
+### 2.1 Basic: Launching the Service
+**Example 1:** Deploying a 128K context service on 8x H800 GPUs.
+```shell
+export ENABLE_V1_KVCACHE_SCHEDULER=1
+python -m fastdeploy.entrypoints.openai.api_server \
+ --model baidu/ERNIE-4.5-VL-424B-A47B-Paddle \
+ --port 8180 \
+ --metrics-port 8181 \
+ --engine-worker-queue-port 8182 \
+ --tensor-parallel-size 8 \
+ --max-model-len 131072 \
+ --max-num-seqs 16 \
+ --limit-mm-per-prompt '{"image": 100, "video": 100}' \
+ --reasoning-parser ernie-45-vl \
+ --gpu-memory-utilization 0.8 \
+ --enable-chunked-prefill \
+ --max-num-batched-tokens 384 \
+ --quantization wint4 \
+ --enable-mm
+```
+
+> ⚠️ For versions 2.1 and above, the new scheduler needs to be enabled via an environment variable `ENABLE_V1_KVCACHE_SCHEDULER=1`. Otherwise, some requests may be truncated before reaching the maximum length or return empty results.
+
+An example is a set of configurations that can run stably while also delivering relatively good performance. If you have further requirements for precision or performance, please continue reading the content below.
+### 2.2 Advanced: How to Achieve Better Performance
+
+#### 2.2.1 Evaluating Application Scenarios and Setting Parameters Correctly
+> **Context Length**
+- **Parameters:** `--max-model-len`
+- **Description:** Controls the maximum context length that the model can process.
+- **Recommendation:** Longer context lengths may reduce throughput. Adjust based on actual needs, with a maximum supported context length of **128k** (131,072).
+
+ ⚠️ Note: Longer context lengths will significantly increase GPU memory requirements. Ensure your hardware resources are sufficient before setting a longer context.
+> **Maximum sequence count**
+- **Parameters:** `--max-num-seqs`
+- **Description:** Controls the maximum number of sequences the service can handle, supporting a range of 1 to 256.
+- **Recommendation:** If you are unsure of the average number of sequences per request in your actual application scenario, we recommend setting it to **256**. If the average number of sequences per request in your application is significantly fewer than 256, we suggest setting it to a slightly higher value than the average to further reduce GPU memory usage and optimize service performance.
+
+> **Multi-image and multi-video input**
+- **Parameters**:`--limit-mm-per-prompt`
+- **Description**:Our model supports multi-image and multi-video input in a single prompt. Please use this **Parameters** setting to limit the number of images/videos per request, ensuring efficient resource utilization.
+- **Recommendation**:We recommend setting the number of images and videos in a single prompt to **100 each** to balance performance and memory usage.
+
+> **Available GPU memory ratio during initialization**
+- **Parameters:** `--gpu-memory-utilization`
+- **Description:** Controls the available GPU memory for FastDeploy service initialization. The default value is 0.9, meaning 10% of the memory is reserved for backup.
+- **Recommendation:** It is recommended to use the default value of 0.9. If an "out of memory" error occurs during stress testing, you may attempt to reduce this value.
+
+#### 2.2.2 Chunked Prefill
+- **Parameters:** `--enable-chunked-prefill`
+- **Description:** Enabling `chunked prefill` can **reduce peak GPU memory usage** and **improve service throughput**.
+- **Other relevant configurations**:
+
+ `--max-num-batched-tokens`:Limit the maximum number of tokens per chunk, with a recommended setting of 384.
+
+#### 2.2.3 **Quantization precision**
+- **Parameters:** `--quantization`
+
+- **Supported precision types:**
+ - wint4 (Suitable for most users)
+ - wint8
+ - bfloat16 (When the `--quantization` parameter is not set, bfloat16 is used by default.)
+
+- **Recommendation:**
+ - Unless you have extremely stringent precision requirements, we strongly recommend using wint4 quantization. This will significantly reduce memory consumption and increase throughput.
+ - If slightly higher precision is required, you may try wint8.
+ - Only consider using bfloat16 if your application scenario demands extreme precision, as it requires significantly more GPU memory.
+
+#### 2.2.4 **Adjustable environment variables**
+> **Rejection sampling:**`FD_SAMPLING_CLASS=rejection`
+- **Description:** Rejection sampling involves generating samples from a proposal distribution that is easy to sample from, thereby avoiding explicit sorting and achieving an effect of improving sampling speed, which can enhance inference performance.
+- **Recommendation:** This is a relatively aggressive optimization strategy that affects the results, and we are still conducting comprehensive validation of its impact. If you have high performance requirements and can accept potential compromises in results, you may consider enabling this strategy.
+
+> **Attention Hyperparameter:**`FLAGS_max_partition_size=1024`
+- **Description:** The hyperparameters for the Append Attention (default) backend have been tested on commonly used datasets, and our results show that setting it to 1024 can significantly improve decoding speed, especially in long-text scenarios.
+- **Recommendation:** In the future, it will be modified to an automatic adjustment mechanism. If you have high performance requirements, you may consider enabling it.
+
+## 3. FAQ
+**Note:** Deploying multimodal services requires adding parameters to the configuration `--enable-mm`.
+
+### 3.1 Out of Memory
+If the service prompts "Out of Memory" during startup, please try the following solutions:
+1. Ensure no other processes are occupying GPU memory;
+2. Use wint4/wint8 quantization and enable chunked prefill;
+3. Reduce context length and maximum sequence count as needed.
+
+If the service starts normally but later reports insufficient memory, try:
+1. Adjust the initial GPU memory utilization ratio by modifying `--gpu-memory-utilization`.
diff --git a/docs/optimal_deployment/FAQ.md b/docs/best_practices/FAQ.md
similarity index 100%
rename from docs/optimal_deployment/FAQ.md
rename to docs/best_practices/FAQ.md
diff --git a/docs/features/plugins.md b/docs/features/plugins.md
new file mode 100644
index 0000000000..0fe97ef7b6
--- /dev/null
+++ b/docs/features/plugins.md
@@ -0,0 +1,99 @@
+# FastDeploy Plugin Mechanism Documentation
+
+FastDeploy supports a plugin mechanism that allows users to extend functionality without modifying the core code. Plugins are automatically discovered and loaded through Python's `entry_points` mechanism.
+
+## How Plugins Work
+
+Plugins are essentially registration functions that are automatically called when FastDeploy starts. The system uses the `load_plugins_by_group` function to ensure that all processes (including child processes in distributed training scenarios) have loaded the required plugins before official operations begin.
+
+## Plugin Discovery Mechanism
+
+FastDeploy uses Python's `entry_points` mechanism to discover and load plugins. Developers need to register their plugins in the specified entry point group in their project.
+
+### Example: Creating a Plugin
+
+#### 1. How Plugin Work
+
+Assuming you have a custom model class `MyModelForCasualLM` and a pretrained class `MyPretrainedModel`, you can write the following registration function:
+
+```python
+# File: fd_add_dummy_model/__init__.py or fd_add_dummy_model/register.py
+from fastdeploy.model_registry import ModelRegistry
+from my_custom_model import MyModelForCasualLM, MyPretrainedModel
+from fastdeploy.config import ErnieArchitectures
+
+def register():
+ if "MyModelForCasualLM" not in ModelRegistry.get_supported_archs():
+ if MyModelForCasualLM.name().startswith("Ernie"):
+ ErnieArchitectures.register_ernie_model_arch(MyModelForCasualLM)
+ ModelRegistry.register_model_class(MyModelForCasualLM)
+ ModelRegistry.register_pretrained_model(MyPretrainedModel)
+```
+Assuming you have a custom model_runner class `MyModelRunner`, you can write the following registration function:
+```python
+# File: fd_add_dummy_model_runner/__init__.py
+from .my_model_runner import MyModelRunner
+
+def get_runner():
+ return MyModelRunner
+```
+
+#### 2. Register Plugin in `setup.py`
+
+```python
+# setup.py
+from setuptools import setup
+
+setup(
+ name="fastdeploy-plugins",
+ version="0.1",
+ packages=["fd_add_dummy_model", "fd_add_dummy_model_runner"],
+ entry_points={
+ "fastdeploy.model_register_plugins": [
+ "fd_add_dummy_model = fd_add_dummy_model:register",
+ ],
+ "fastdeploy.model_runner_plugins": [
+ "model_runner = fd_add_dummy_model:get_runner"
+ ],
+ },
+)
+```
+
+## Plugin Structure
+
+Plugins consist of three components:
+
+| Component | Description |
+|-----------|-------------|
+| **Plugin Group** | The functional group to which the plugin belongs, for example:
- `fastdeploy.model_register_plugins`: for model registration
- `fastdeploy.model_runner_plugins`: for model runner registration
Users can customize groups as needed. |
+| **Plugin Name** | The unique identifier for each plugin (e.g., `fd_add_dummy_model`), which can be controlled via the `FD_PLUGINS` environment variable to determine whether to load the plugin. |
+| **Plugin Value** | Format is `module_name:function_name`, pointing to the entry function that executes the registration logic. |
+
+## Controlling Plugin Loading Behavior
+
+By default, FastDeploy loads all registered plugins. To load only specific plugins, you can set the environment variable:
+
+```bash
+export FD_PLUGINS=fastdeploy-plugins
+```
+
+Multiple plugin names can be separated by commas:
+
+```bash
+export FD_PLUGINS=plugin_a,plugin_b
+```
+
+## Reference Example
+
+Please refer to the example plugin implementation in the project directory:
+```
+./test/plugins/
+```
+
+It contains a complete plugin structure and `setup.py` configuration example.
+
+## Summary
+
+Through the plugin mechanism, users can easily add custom models or functional modules to FastDeploy without modifying the core source code. This not only enhances system extensibility but also facilitates third-party developers in extending functionality.
+
+For further plugin development, please refer to the `model_registry` and `plugin_loader` modules in the FastDeploy source code.
diff --git a/docs/get_started/ernie-4.5-vl.md b/docs/get_started/ernie-4.5-vl.md
index 71b0626ae6..015fc6e5af 100644
--- a/docs/get_started/ernie-4.5-vl.md
+++ b/docs/get_started/ernie-4.5-vl.md
@@ -23,6 +23,7 @@ Execute the following command to start the service. For parameter configurations
>💡 **Note**: Since the model parameter size is 424B-A47B, on an 80G * 8 GPU machine, specify ```--quantization wint4``` (wint8 is also supported).
```shell
+export ENABLE_V1_KVCACHE_SCHEDULER=1
python -m fastdeploy.entrypoints.openai.api_server \
--model baidu/ERNIE-4.5-VL-424B-A47B-Paddle \
--port 8180 --engine-worker-queue-port 8181 \
@@ -31,7 +32,6 @@ python -m fastdeploy.entrypoints.openai.api_server \
--quantization wint4 \
--max-model-len 32768 \
--max-num-seqs 32 \
- --enable-mm \
--mm-processor-kwargs '{"video_max_frames": 30}' \
--limit-mm-per-prompt '{"image": 10, "video": 3}' \
--reasoning-parser ernie-45-vl
diff --git a/docs/get_started/installation/Enflame_gcu.md b/docs/get_started/installation/Enflame_gcu.md
index e443a7ce3a..46d7f0d845 100644
--- a/docs/get_started/installation/Enflame_gcu.md
+++ b/docs/get_started/installation/Enflame_gcu.md
@@ -53,24 +53,21 @@ After driver installation, **re-enter the Docker container**:
docker start paddle-gcu-llm
docker exec -it paddle-gcu-llm bash
```
-5. Install PaddlePaddle & PaddleCustomDevice
+5. Install PaddlePaddle
```bash
# PaddlePaddle Deep Learning Framework provides fundamental computing capabilities
-python -m pip install paddlepaddle==3.1.1 -i https://www.paddlepaddle.org.cn/packages/stable/cpu/
-
+python -m pip install paddlepaddle==3.1.0a0 -i https://www.paddlepaddle.org.cn/packages/stable/cpu/
+```
+6. Install PaddleCustomDevice
+```bash
# PaddleCustomDevice implements custom hardware backend for PaddlePaddle, providing GCU operator implementations
-python -m pip install paddle-custom-gcu==3.1.1 -i https://www.paddlepaddle.org.cn/packages/stable/gcu/
+python -m pip install paddle-custom-gcu==3.1.0 -i https://www.paddlepaddle.org.cn/packages/stable/gcu/
# For source compilation, refer to: https://github.com/PaddlePaddle/PaddleCustomDevice/blob/develop/backends/gcu/README_cn.md
```
-For latest paddle verion on iluvatar. Refer to [PaddlePaddle Installation](https://www.paddlepaddle.org.cn/)
-
-6. Install FastDeploy and dependencies
+7. Install FastDeploy and dependencies
```bash
python -m pip install fastdeploy -i https://www.paddlepaddle.org.cn/packages/stable/gcu/ --extra-index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simplels
-```
-
-You can build FastDeploy from source if you need the ```latest version```.
-```bash
+# For source compilation, refer to the following steps
git clone https://github.com/PaddlePaddle/FastDeploy
cd FastDeploy
python -m pip install -r requirements.txt --extra-index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simplels
diff --git a/docs/get_started/installation/iluvatar_gpu.md b/docs/get_started/installation/iluvatar_gpu.md
index e6ca5d3f36..658d9605f3 100644
--- a/docs/get_started/installation/iluvatar_gpu.md
+++ b/docs/get_started/installation/iluvatar_gpu.md
@@ -1,12 +1,12 @@
# Run ERNIE-4.5-300B-A47B & ERNIE-4.5-21B-A3B model on iluvatar machine
-The current version of the software merely serves as a demonstration demo for the Iluvatar CoreX combined with the Fastdeploy inference framework for large models. There may be issues when running the latest ERNIE4.5 model, and we will conduct repairs and performance optimization in the future. Subsequent versions will provide customers with a more stable version.
+The current version of the software merely serves as a demonstration demo for the Iluvatar CoreX combined with the Fastdeploy inference framework for large models. Running the latest ERNIE4.5 300B model on the GSM8K dataset takes about 6.3 hours.
## Machine Preparation
-First, you need to prepare a machine with the following configurations:
+First, the `TP=16` when running the ERNIE4.5 300B model and so you need to prepare a machine with the following configurations:
| CPU | Memory | Card | Hard Disk|
| :---: | :---: | :---: | :---: |
-| x86 | 1TB| 8xBI150| 1TB|
+| x86 | 1TB| 16xBI150| 1TB|
Currently, the entire model needs to be loaded into the host memory, which requires more than 600GB of host memory. This issue will be optimized in subsequent versions.
@@ -18,7 +18,7 @@ docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/device/paddle-ixuca:latest
```
## Container Preparation
-### Start Container
+1. Start Container
```bash
docker run -itd --name paddle_infer -v /usr/src:/usr/src -v /lib/modules:/lib/modules -v /dev:/dev -v /home/paddle:/home/paddle --privileged --cap-add=ALL --pid=host ccr-2vdh3abv-pub.cnc.bj.baidubce.com/device/paddle-ixuca:latest
@@ -27,26 +27,13 @@ docker exec -it paddle_infer bash
/home/paddle contains the model files, *.whl packages, and scripts.
-### Install paddle
+1. Install packages
```bash
-pip3 install paddlepaddle==3.1.1 -i https://www.paddlepaddle.org.cn/packages/stable/cpu/
-pip3 install paddle-iluvatar-gpu==3.1.1 -i https://www.paddlepaddle.org.cn/packages/stable/ixuca/
-```
-For latest paddle verion on iluvatar. Refer to [PaddlePaddle Installation](https://www.paddlepaddle.org.cn/)
-
-### Install or build FastDeploy
-```bash
+pip3 install paddlepaddle==3.1.0a0 -i https://www.paddlepaddle.org.cn/packages/stable/cpu/
+pip3 install paddle-iluvatar-gpu==3.1.0 -i https://www.paddlepaddle.org.cn/packages/stable/ixuca/
pip3 install fastdeploy_iluvatar_gpu==2.1.0.dev0 -i https://www.paddlepaddle.org.cn/packages/stable/ixuca/ --extra-index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simplels
```
-You can build FastDeploy from source if you need the ```latest version```.
-```bash
-git clone https://github.com/PaddlePaddle/FastDeploy
-cd FastDeploy
-pip install -r requirements_iluvatar.txt
-export LD_PRELOAD=/usr/local/corex/lib64/libcuda.so.1
-bash build.sh
-```
## Prepare the inference demo script
@@ -59,6 +46,7 @@ script list below:
export PADDLE_XCCL_BACKEND=iluvatar_gpu
export INFERENCE_MSG_QUEUE_ID=232132
export LD_PRELOAD=/usr/local/corex/lib64/libcuda.so.1
+export FD_SAMPLING_CLASS=rejection
export FD_DEBUG=1
python3 run_demo.py
```
@@ -77,7 +65,7 @@ prompts = [
sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=256)
# load the model
-llm = LLM(model="/home/paddle/ernie-4_5-21b-a3b-bf16-paddle", tensor_parallel_size=4, max_model_len=8192, static_decode_blocks=0, quantization='wint8')
+llm = LLM(model="/home/paddle/ernie-4_5-21b-a3b-bf16-paddle", tensor_parallel_size=4, max_model_len=8192, static_decode_blocks=0, block_size=16, quantization='wint8')
# Perform batch inference
outputs = llm.generate(prompts, sampling_params)
@@ -131,3 +119,281 @@ Now, let's break down each step:
**Step 3: Drawing the
The largest ocean is the Pacific Ocean, covering an area of approximately ⦠[3], The first scientific expeditions to determine the ocean's depth were the Challenger expedition (1872â1876) and the U.S. Navy Hydrographic Office survey (1877â1879). The oceanic crust is thin and irregular, consisting of upward moving magma from the mantle below, and cooling and solidifying on the surface. The shallowest parts of the ocean are called the continental shelves. Large tides are caused mainly by the alignment of the Sun, Moon, and Earth during new or full moons. The origin of the word "ocean" is not clear. The first global oceanic topography survey was completed by the Challenger expedition (1872â1876). [57] The sound speed in the ocean is primarily a function of water temperature and salinity, and varies with depth. The deep-ocean floor is mostly flat and devoid of life, with the exception of seamounts and various underwater volcanic features, including seamounts and hydrothermal vents. [73] Today, the five ocean
```
+
+## Run ernie4.5 300B model with the GSM8K dataset
+
+1. Download GSM8K dataset
+
+```bash
+wget https://raw.githubusercontent.com/openai/grade-school-math/master/grade_school_math/data/test.jsonl
+```
+
+2. Prepare `bench_gsm8k.py`
+
+```python
+# Copyright (c) 2025 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+""" Fastdeploy + ERNIE-4.5-Turbo 的指标评估 """
+# adapted from https://github.com/sgl-project/sglang/blob/main/benchmark/gsm8k/bench_other.py
+import argparse
+import ast
+import json
+import re
+import time
+from concurrent.futures import ThreadPoolExecutor
+
+import numpy as np
+import requests
+from tqdm import tqdm
+
+INVALID = -9999999
+
+
+def call_generate(prompt, **kwargs):
+ """
+ Generates response based on the input prompt.
+
+ Args:
+ prompt (str): The input prompt text.
+ **kwargs: Keyword arguments, including server IP address and port number.
+
+ Returns:
+ str: The response generated based on the prompt.
+
+ """
+ url = f"http://{kwargs['ip']}:{kwargs['port']}/v1/chat/completions"
+ headers = {"Content-Type": "application/json"}
+ data = {
+ "messages": [
+ {
+ "role": "user",
+ "content": prompt,
+ }
+ ],
+ "temperature": 0.6,
+ "max_tokens": 2047,
+ "top_p": 0.95,
+ "do_sample": True,
+ }
+
+ response = requests.post(url, headers=headers, data=json.dumps(data))
+ out = response.json()
+ return out["choices"][0]["message"]["content"]
+
+
+def get_one_example(lines, i, include_answer):
+ """
+ Retrieves a question-answer example from the given list of text lines.
+
+ Args:
+ lines (list of dict): A list of question-answer pairs.
+ i (int): The index of the question-answer pair to retrieve from lines.
+ include_answer (bool): Whether to include the answer in the returned string.
+
+ Returns:
+ str: A formatted question-answer string in the format "Question: \nAnswer: ".
+
+ """
+ ret = "Question: " + lines[i]["question"] + "\nAnswer:"
+ if include_answer:
+ ret += " " + lines[i]["answer"]
+ return ret
+
+
+def get_few_shot_examples(lines, k):
+ """
+ Selects k examples from the given list of text lines and concatenates them into a single string.
+
+ Args:
+ lines (list): A list containing text lines.
+ k (int): The number of examples to select.
+
+ Returns:
+ str: A string composed of k examples, separated by two newline characters.
+ """
+ ret = ""
+ for i in range(k):
+ ret += get_one_example(lines, i, True) + "\n\n"
+ return ret
+
+
+def get_answer_value(answer_str):
+ """
+ Extracts numerical values from an answer string and returns them.
+
+ Args:
+ answer_str (str): The string containing the answer.
+
+ Returns:
+ The extracted numerical value; returns "INVALID" if extraction fails.
+ """
+ answer_str = answer_str.replace(",", "")
+ numbers = re.findall(r"\d+", answer_str)
+ if len(numbers) < 1:
+ return INVALID
+ try:
+ return ast.literal_eval(numbers[-1])
+ except SyntaxError:
+ return INVALID
+
+
+def read_jsonl(filename: str):
+ """
+ Reads a JSONL file.
+
+ Args:
+ filename (str): Path to the JSONL file.
+
+ Yields:
+ dict: A dictionary object corresponding to each line in the JSONL file.
+ """
+ with open(filename) as fin:
+ for line in fin:
+ if line.startswith("#"):
+ continue
+ yield json.loads(line)
+
+
+def main(args):
+ """
+ Process inputs and generate answers by calling the model in parallel using a thread pool.
+
+ Args:
+ args (argparse.Namespace):
+ - num_questions (int): Number of questions to process.
+ - num_shots (int): Number of few-shot learning examples.
+ - ip (str): IP address of the model service.
+ - port (int): Port number of the model service.
+ - parallel (int): Number of questions to process in parallel.
+ - result_file (str): File path to store the results.
+
+ Returns:
+ None
+
+ """
+ # Read data
+ filename = "test.jsonl"
+
+ lines = list(read_jsonl(filename))
+
+ # Construct prompts
+ num_questions = args.num_questions
+ num_shots = args.num_shots
+ few_shot_examples = get_few_shot_examples(lines, num_shots)
+
+ questions = []
+ labels = []
+ for i in range(len(lines[:num_questions])):
+ questions.append(get_one_example(lines, i, False))
+ labels.append(get_answer_value(lines[i]["answer"]))
+ assert all(l != INVALID for l in labels)
+
+ states = [None] * len(labels)
+
+ # Use thread pool
+ def get_one_answer(i):
+ answer = call_generate(
+ prompt=few_shot_examples + questions[i],
+ # stop=["Question", "Assistant:", "<|separator|>"],
+ ip=args.ip,
+ port=args.port,
+ )
+ states[i] = answer
+
+ tic = time.time()
+ if args.parallel == 1:
+ for i in tqdm(range(len(questions))):
+ get_one_answer(i)
+ else:
+ with ThreadPoolExecutor(args.parallel) as executor:
+ list(
+ tqdm(
+ executor.map(get_one_answer, list(range(len(questions)))),
+ total=len(questions),
+ )
+ )
+
+ latency = time.time() - tic
+ preds = []
+ for i in range(len(states)):
+ preds.append(get_answer_value(states[i]))
+
+ # Compute accuracy
+ acc = np.mean(np.array(preds) == np.array(labels))
+ invalid = np.mean(np.array(preds) == INVALID)
+
+ # Print results
+ print(f"Accuracy: {acc:.3f}")
+ print(f"Invalid: {invalid:.3f}")
+ print(f"Latency: {latency:.3f} s")
+
+ with open(args.result_file, "a") as fout:
+ value = {
+ "task": "gsm8k",
+ "backend": "paddlepaddle",
+ "num_gpus": 1,
+ "latency": round(latency, 3),
+ "accuracy": round(acc, 3),
+ "num_requests": args.num_questions,
+ "other": {
+ "num_questions": args.num_questions,
+ "parallel": args.parallel,
+ },
+ }
+ fout.write(json.dumps(value) + "\n")
+
+
+if __name__ == "__main__":
+ parser = argparse.ArgumentParser()
+ parser.add_argument("--ip", type=str, default="127.0.0.1")
+ parser.add_argument("--port", type=str, default="8188")
+ parser.add_argument("--num-shots", type=int, default=10)
+ parser.add_argument("--data-path", type=str, default="test.jsonl")
+ parser.add_argument("--num-questions", type=int, default=1319)
+ parser.add_argument("--result-file", type=str, default="result.jsonl")
+ parser.add_argument("--parallel", type=int, default=1)
+ args = parser.parse_args()
+ main(args)
+```
+
+3. Prepare `run_bench.sh`
+
+```bash
+#!/bin/bash
+export PADDLE_XCCL_BACKEND=iluvatar_gpu
+export INFERENCE_MSG_QUEUE_ID=232132
+export LD_PRELOAD=/usr/local/corex/lib64/libcuda.so.1
+export FD_SAMPLING_CLASS=rejection
+
+python3 -m fastdeploy.entrypoints.openai.api_server --model "/home/paddle/ernie-45t" --port 8188 --tensor-parallel-size 16 --block-size 16 --static-decode-blocks 0 --quantization wint8
+```
+
+4. Running the Script
+
+Firstly, open a terminal and run:
+```bash
+./run_bench.sh
+```
+After the service is ready, open another terminal and run:
+```bash
+python3 -u bench_gsm8k.py --port 8188 --num-questions 1319 --num-shots 5 --parallel 8
+```
+It takes about 6.3 hours to run the GSM8K dataset.
+
+```
+Accuracy: 0.964
+Invaild: 0.000
+Latency: 22918.186 s
+```
diff --git a/docs/get_started/installation/kunlunxin_xpu.md b/docs/get_started/installation/kunlunxin_xpu.md
index 4950347ce1..aeaae3bac6 100644
--- a/docs/get_started/installation/kunlunxin_xpu.md
+++ b/docs/get_started/installation/kunlunxin_xpu.md
@@ -6,7 +6,7 @@
- Python: 3.10
- XPU Model: P800
- XPU Driver Version: ≥ 5.0.21.26
-- XPU Firmware Version: ≥ 1.31
+- XPU Firmware Version: ≥ 1.48
Verified platform:
- CPU: INTEL(R) XEON(R) PLATINUM 8563C / Hygon C86-4G 7490 64-core Processor
@@ -16,7 +16,7 @@ Verified platform:
- Python: 3.10
- XPU Model: P800 (OAM Edition)
- XPU Driver Version: 5.0.21.26
-- XPU Firmware Version: 1.31
+- XPU Firmware Version: 1.48
**Note:** Currently, only INTEL or Hygon CPU-based P800 (OAM Edition) servers have been verified. Other CPU types and P800 (PCIe Edition) servers have not been tested yet.
diff --git a/docs/get_started/installation/nvidia_gpu.md b/docs/get_started/installation/nvidia_gpu.md
index 97e3dc7503..381a2de0a2 100644
--- a/docs/get_started/installation/nvidia_gpu.md
+++ b/docs/get_started/installation/nvidia_gpu.md
@@ -13,14 +13,14 @@ The following installation methods are available when your environment meets the
**Notice**: The pre-built image only supports SM80/90 GPU(e.g. H800/A800),if you are deploying on SM86/89GPU(L40/4090/L20), please reinstall ```fastdpeloy-gpu``` after you create the container.
```shell
-docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/fastdeploy-cuda-12.6:2.0.0
+docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/fastdeploy-cuda-12.6:2.1.0
```
## 2. Pre-built Pip Installation
First install paddlepaddle-gpu. For detailed instructions, refer to [PaddlePaddle Installation](https://www.paddlepaddle.org.cn/en/install/quick?docurl=/documentation/docs/en/develop/install/pip/linux-pip_en.html)
```shell
-python -m pip install paddlepaddle-gpu==3.1.0 -i https://www.paddlepaddle.org.cn/packages/stable/cu126/
+python -m pip install paddlepaddle-gpu==3.1.1 -i https://www.paddlepaddle.org.cn/packages/stable/cu126/
```
Then install fastdeploy. **Do not install from PyPI**. Use the following methods instead:
@@ -58,7 +58,7 @@ docker build -f dockerfiles/Dockerfile.gpu -t fastdeploy:gpu .
First install paddlepaddle-gpu. For detailed instructions, refer to [PaddlePaddle Installation](https://www.paddlepaddle.org.cn/en/install/quick?docurl=/documentation/docs/en/develop/install/pip/linux-pip_en.html)
```shell
-python -m pip install paddlepaddle-gpu==3.1.0 -i https://www.paddlepaddle.org.cn/packages/stable/cu126/
+python -m pip install paddlepaddle-gpu==3.1.1 -i https://www.paddlepaddle.org.cn/packages/stable/cu126/
```
Then clone the source code and build:
diff --git a/docs/get_started/quick_start_vl.md b/docs/get_started/quick_start_vl.md
index 83b1b97d7d..b9c50a1c26 100644
--- a/docs/get_started/quick_start_vl.md
+++ b/docs/get_started/quick_start_vl.md
@@ -19,6 +19,7 @@ For more information about how to install FastDeploy, refer to the [installation
After installing FastDeploy, execute the following command in the terminal to start the service. For the configuration method of the startup command, refer to [Parameter Description](../parameters.md)
```shell
+export ENABLE_V1_KVCACHE_SCHEDULER=1
python -m fastdeploy.entrypoints.openai.api_server \
--model baidu/ERNIE-4.5-VL-28B-A3B-Paddle \
--port 8180 \
@@ -26,8 +27,7 @@ python -m fastdeploy.entrypoints.openai.api_server \
--engine-worker-queue-port 8182 \
--max-model-len 32768 \
--max-num-seqs 32 \
- --reasoning-parser ernie-45-vl \
- --enable-mm
+ --reasoning-parser ernie-45-vl
```
> 💡 Note: In the path specified by ```--model```, if the subdirectory corresponding to the path does not exist in the current directory, it will try to query whether AIStudio has a preset model based on the specified model name (such as ```baidu/ERNIE-4.5-0.3B-Base-Paddle```). If it exists, it will automatically start downloading. The default download path is: ```~/xx```. For instructions and configuration on automatic model download, see [Model Download](../supported_models.md).
diff --git a/docs/index.md b/docs/index.md
index 1149811acd..b1e3c336fd 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -13,12 +13,12 @@
| Model | Data Type | PD Disaggregation | Chunked Prefill | Prefix Caching | MTP | CUDA Graph | Maximum Context Length |
|:--- | :------- | :---------- | :-------- | :-------- | :----- | :----- | :----- |
-|ERNIE-4.5-300B-A47B | BF16/WINT4/WINT8/W4A8C8/WINT2/FP8 | ✅| ✅ | ✅|✅(WINT4)| WIP |128K |
-|ERNIE-4.5-300B-A47B-Base| BF16/WINT4/WINT8 | ✅| ✅ | ✅|✅(WINT4)| WIP | 128K |
+|ERNIE-4.5-300B-A47B | BF16/WINT4/WINT8/W4A8C8/WINT2/FP8 | ✅| ✅ | ✅|✅| WIP |128K |
+|ERNIE-4.5-300B-A47B-Base| BF16/WINT4/WINT8 | ✅| ✅ | ✅|❌| WIP | 128K |
|ERNIE-4.5-VL-424B-A47B | BF16/WINT4/WINT8 | WIP | ✅ | WIP | ❌ | WIP |128K |
|ERNIE-4.5-VL-28B-A3B | BF16/WINT4/WINT8 | ❌ | ✅ | WIP | ❌ | WIP |128K |
-|ERNIE-4.5-21B-A3B | BF16/WINT4/WINT8/FP8 | ❌ | ✅ | ✅ | WIP | ✅|128K |
-|ERNIE-4.5-21B-A3B-Base | BF16/WINT4/WINT8/FP8 | ❌ | ✅ | ✅ | WIP | ✅|128K |
+|ERNIE-4.5-21B-A3B | BF16/WINT4/WINT8/FP8 | ❌ | ✅ | ✅ | ✅ | ✅|128K |
+|ERNIE-4.5-21B-A3B-Base | BF16/WINT4/WINT8/FP8 | ❌ | ✅ | ✅ | ❌ | ✅|128K |
|ERNIE-4.5-0.3B | BF16/WINT8/FP8 | ❌ | ✅ | ✅ | ❌ | ✅| 128K |
## Documentation
diff --git a/docs/offline_inference.md b/docs/offline_inference.md
index 3bb52a1911..31f79b7490 100644
--- a/docs/offline_inference.md
+++ b/docs/offline_inference.md
@@ -39,7 +39,7 @@ Documentation for `SamplingParams`, `LLM.generate`, `LLM.chat`, and output struc
```python
from fastdeploy.entrypoints.llm import LLM
# 加载模型
-llm = LLM(model="baidu/ERNIE-4.5-VL-28B-A3B-Paddle", tensor_parallel_size=1, max_model_len=32768, enable_mm=True, limit_mm_per_prompt={"image": 100}, reasoning_parser="ernie-45-vl")
+llm = LLM(model="baidu/ERNIE-4.5-VL-28B-A3B-Paddle", tensor_parallel_size=1, max_model_len=32768, limit_mm_per_prompt={"image": 100}, reasoning_parser="ernie-45-vl")
outputs = llm.chat(
messages=[
@@ -127,7 +127,7 @@ for message in messages:
})
sampling_params = SamplingParams(temperature=0.1, max_tokens=6400)
-llm = LLM(model=PATH, tensor_parallel_size=1, max_model_len=32768, enable_mm=True, limit_mm_per_prompt={"image": 100}, reasoning_parser="ernie-45-vl")
+llm = LLM(model=PATH, tensor_parallel_size=1, max_model_len=32768, limit_mm_per_prompt={"image": 100}, reasoning_parser="ernie-45-vl")
outputs = llm.generate(prompts={
"prompt": prompt,
"multimodal_data": {
diff --git a/docs/parameters.md b/docs/parameters.md
index c52fc9ac6f..28a66b72c9 100644
--- a/docs/parameters.md
+++ b/docs/parameters.md
@@ -8,6 +8,8 @@ When using FastDeploy to deploy models (including offline inference and service
|:--------------|:----|:-----------|
| ```port``` | `int` | Only required for service deployment, HTTP service port number, default: 8000 |
| ```metrics_port``` | `int` | Only required for service deployment, metrics monitoring port number, default: 8001 |
+| ```max_waiting_time``` | `int` | Only required for service deployment, maximum wait time for establishing a connection upon service request. Default: -1 (indicates no wait time limit).|
+| ```max_concurrency``` | `int` | Only required for service deployment, the actual number of connections established by the service, default 512 |
| ```engine_worker_queue_port``` | `int` | FastDeploy internal engine communication port, default: 8002 |
| ```cache_queue_port``` | `int` | FastDeploy internal KVCache process communication port, default: 8003 |
| ```max_model_len``` | `int` | Default maximum supported context length for inference, default: 2048 |
@@ -19,7 +21,7 @@ When using FastDeploy to deploy models (including offline inference and service
| ```tokenizer``` | `str` | Tokenizer name or path, defaults to model path |
| ```use_warmup``` | `int` | Whether to perform warmup at startup, will automatically generate maximum length data for warmup, enabled by default when automatically calculating KV Cache |
| ```limit_mm_per_prompt``` | `dict[str]` | Limit the amount of multimodal data per prompt, e.g.: {"image": 10, "video": 3}, default: 1 for all |
-| ```enable_mm``` | `bool` | Whether to support multimodal data (for multimodal models only), default: False |
+| ```enable_mm``` | `bool` | __[DEPRECATED]__ Whether to support multimodal data (for multimodal models only), default: False |
| ```quantization``` | `str` | Model quantization strategy, when loading BF16 CKPT, specifying wint4 or wint8 supports lossless online 4bit/8bit quantization |
| ```gpu_memory_utilization``` | `float` | GPU memory utilization, default: 0.9 |
| ```num_gpu_blocks_override``` | `int` | Preallocated KVCache blocks, this parameter can be automatically calculated by FastDeploy based on memory situation, no need for user configuration, default: None |
diff --git a/docs/usage/environment_variables.md b/docs/usage/environment_variables.md
index a8f3ac17b2..31f895370f 100644
--- a/docs/usage/environment_variables.md
+++ b/docs/usage/environment_variables.md
@@ -38,7 +38,7 @@ environment_variables: dict[str, Callable[[], Any]] = {
# Whether to use HuggingFace tokenizer (0 or 1)
"FD_USE_HF_TOKENIZER":
- lambda: os.getenv("FD_USE_HF_TOKENIZER", 0),
+ lambda: bool(int(os.getenv("FD_USE_HF_TOKENIZER", 0))),
# ZMQ send high-water mark (HWM) during initialization
"FD_ZMQ_SNDHWM":
diff --git a/docs/usage/kunlunxin_xpu_deployment.md b/docs/usage/kunlunxin_xpu_deployment.md
index 455152d59c..1096db3399 100644
--- a/docs/usage/kunlunxin_xpu_deployment.md
+++ b/docs/usage/kunlunxin_xpu_deployment.md
@@ -2,9 +2,9 @@
|Model Name|Context Length|Quantization|XPUs Required|Deployment Commands|Minimum Version Required|
|-|-|-|-|-|-|
|ERNIE-4.5-300B-A47B|32K|WINT8|8|export XPU_VISIBLE_DEVICES="0,1,2,3,4,5,6,7"
python -m fastdeploy.entrypoints.openai.api_server \
--model PaddlePaddle/ERNIE-4.5-300B-A47B-Paddle \
--port 8188 \
--tensor-parallel-size 8 \
--max-model-len 32768 \
--max-num-seqs 64 \
--quantization "wint8" \
--gpu-memory-utilization 0.9|>=2.0.3|
-|ERNIE-4.5-300B-A47B|32K|WINT4|4 (recommend)|export XPU_VISIBLE_DEVICES="0,1,2,3" or "4,5,6,7"
python -m fastdeploy.entrypoints.openai.api_server \
--model PaddlePaddle/ERNIE-4.5-300B-A47B-Paddle \
--port 8188 \
--tensor-parallel-size 4 \
--max-model-len 32768 \
--max-num-seqs 64 \
--quantization "wint4" \
--gpu-memory-utilization 0.9|>=2.0.0|
-|ERNIE-4.5-300B-A47B|32K|WINT4|8|export XPU_VISIBLE_DEVICES="0,1,2,3,4,5,6,7"
python -m fastdeploy.entrypoints.openai.api_server \
--model PaddlePaddle/ERNIE-4.5-300B-A47B-Paddle \
--port 8188 \
--tensor-parallel-size 8 \
--max-model-len 32768 \
--max-num-seqs 64 \
--quantization "wint4" \
--gpu-memory-utilization 0.9|>=2.0.0|
-|ERNIE-4.5-300B-A47B|128K|WINT4|8 (recommend)|export XPU_VISIBLE_DEVICES="0,1,2,3,4,5,6,7"
python -m fastdeploy.entrypoints.openai.api_server \
--model PaddlePaddle/ERNIE-4.5-300B-A47B-Paddle \
--port 8188 \
--tensor-parallel-size 8 \
--max-model-len 131072 \
--max-num-seqs 64 \
--quantization "wint4" \
--gpu-memory-utilization 0.9|>=2.0.0|
+|ERNIE-4.5-300B-A47B|32K|WINT4|4 (Recommended)|export XPU_VISIBLE_DEVICES="0,1,2,3" or "4,5,6,7"
python -m fastdeploy.entrypoints.openai.api_server \
--model PaddlePaddle/ERNIE-4.5-300B-A47B-Paddle \
--port 8188 \
--tensor-parallel-size 4 \
--max-model-len 32768 \
--max-num-seqs 64 \
--quantization "wint4" \
--gpu-memory-utilization 0.9|>=2.0.0|
+|ERNIE-4.5-300B-A47B|32K|WINT4|8|export XPU_VISIBLE_DEVICES="0,1,2,3,4,5,6,7"
python -m fastdeploy.entrypoints.openai.api_server \
--model PaddlePaddle/ERNIE-4.5-300B-A47B-Paddle \
--port 8188 \
--tensor-parallel-size 8 \
--max-model-len 32768 \
--max-num-seqs 64 \
--quantization "wint4" \
--gpu-memory-utilization 0.95|>=2.0.0|
+|ERNIE-4.5-300B-A47B|128K|WINT4|8 (Recommended)|export XPU_VISIBLE_DEVICES="0,1,2,3,4,5,6,7"
python -m fastdeploy.entrypoints.openai.api_server \
--model PaddlePaddle/ERNIE-4.5-300B-A47B-Paddle \
--port 8188 \
--tensor-parallel-size 8 \
--max-model-len 131072 \
--max-num-seqs 64 \
--quantization "wint4" \
--gpu-memory-utilization 0.9|>=2.0.0|
|ERNIE-4.5-21B-A3B|32K|BF16|1|export XPU_VISIBLE_DEVICES="0" # Specify any card
python -m fastdeploy.entrypoints.openai.api_server \
--model PaddlePaddle/ERNIE-4.5-21B-A3B-Paddle \
--port 8188 \
--tensor-parallel-size 1 \
--max-model-len 32768 \
--max-num-seqs 128 \
--gpu-memory-utilization 0.9|>=2.1.0|
|ERNIE-4.5-21B-A3B|32K|WINT8|1|export XPU_VISIBLE_DEVICES="0" # Specify any card
python -m fastdeploy.entrypoints.openai.api_server \
--model PaddlePaddle/ERNIE-4.5-21B-A3B-Paddle \
--port 8188 \
--tensor-parallel-size 1 \
--max-model-len 32768 \
--max-num-seqs 128 \
--quantization "wint8" \
--gpu-memory-utilization 0.9|>=2.1.0|
|ERNIE-4.5-21B-A3B|32K|WINT4|1|export XPU_VISIBLE_DEVICES="0" # Specify any card
python -m fastdeploy.entrypoints.openai.api_server \
--model PaddlePaddle/ERNIE-4.5-21B-A3B-Paddle \
--port 8188 \
--tensor-parallel-size 1 \
--max-model-len 32768 \
--max-num-seqs 128 \
--quantization "wint4" \
--gpu-memory-utilization 0.9|>=2.1.0|
@@ -89,4 +89,4 @@ for chunk in response:
print('\n')
```
-For detailed OpenAI protocol specifications, see [OpenAI Chat Compeltion API](https://platform.openai.com/docs/api-reference/chat/create). Differences from the standard OpenAI protocol are documented in [OpenAI Protocol-Compatible API Server](../../online_serving/README.md).
+For detailed OpenAI protocol specifications, see [OpenAI Chat Compeltion API](https://platform.openai.com/docs/api-reference/chat/create). Differences from the standard OpenAI protocol are documented in [OpenAI Protocol-Compatible API Server](../online_serving/README.md).
diff --git a/docs/zh/optimal_deployment/ERNIE-4.5-0.3B-Paddle.md b/docs/zh/best_practices/ERNIE-4.5-0.3B-Paddle.md
similarity index 99%
rename from docs/zh/optimal_deployment/ERNIE-4.5-0.3B-Paddle.md
rename to docs/zh/best_practices/ERNIE-4.5-0.3B-Paddle.md
index a263bcc525..4b9eb3343a 100644
--- a/docs/zh/optimal_deployment/ERNIE-4.5-0.3B-Paddle.md
+++ b/docs/zh/best_practices/ERNIE-4.5-0.3B-Paddle.md
@@ -2,6 +2,7 @@
## 一、环境准备
### 1.1 支持情况
ERNIE-4.5-0.3B 各量化精度,在下列硬件上部署所需要的最小卡数如下:
+
| | WINT8 | WINT4 | FP8 |
|-----|-----|-----|-----|
|H800 80GB| 1 | 1 | 1 |
diff --git a/docs/zh/optimal_deployment/ERNIE-4.5-21B-A3B-Paddle.md b/docs/zh/best_practices/ERNIE-4.5-21B-A3B-Paddle.md
similarity index 97%
rename from docs/zh/optimal_deployment/ERNIE-4.5-21B-A3B-Paddle.md
rename to docs/zh/best_practices/ERNIE-4.5-21B-A3B-Paddle.md
index d5a26df3d6..db4985cc75 100644
--- a/docs/zh/optimal_deployment/ERNIE-4.5-21B-A3B-Paddle.md
+++ b/docs/zh/best_practices/ERNIE-4.5-21B-A3B-Paddle.md
@@ -2,6 +2,7 @@
## 一、环境准备
### 1.1 支持情况
ERNIE-4.5-21B-A3B 各量化精度,在下列硬件上部署所需要的最小卡数如下:
+
| | WINT8 | WINT4 | FP8 |
|-----|-----|-----|-----|
|H800 80GB| 1 | 1 | 1 |
@@ -110,7 +111,6 @@ export INFERENCE_MSG_QUEUE_ID=1315
export FLAGS_max_partition_size=2048
export FD_ATTENTION_BACKEND=FLASH_ATTN
export FD_LOG_DIR="prefill_log"
-export ENABLE_V1_KVCACHE_SCHEDULER=1
quant_type=block_wise_fp8
export FD_USE_DEEP_GEMM=0
@@ -120,7 +120,7 @@ python -m fastdeploy.entrypoints.openai.api_server --model baidu/ERNIE-4.5-21B-A
--max-num-seqs 20 \
--num-gpu-blocks-override 40000 \
--quantization ${quant_type} \
- --gpu-memory-utilization 0.9 \
+ --gpu-memory-utilization 0.9 --kv-cache-ratio 0.9 \
--port 7012 --engine-worker-queue-port 7013 --metrics-port 7014 --tensor-parallel-size 4 \
--cache-queue-port 7015 \
--splitwise-role "prefill" \
@@ -131,7 +131,6 @@ export CUDA_VISIBLE_DEVICES=4,5,6,7
export INFERENCE_MSG_QUEUE_ID=1215
export FLAGS_max_partition_size=2048
export FD_LOG_DIR="decode_log"
-export ENABLE_V1_KVCACHE_SCHEDULER=1
quant_type=block_wise_fp8
export FD_USE_DEEP_GEMM=0
@@ -140,7 +139,7 @@ python -m fastdeploy.entrypoints.openai.api_server --model baidu/ERNIE-4.5-21B-A
--max-model-len 131072 \
--max-num-seqs 20 \
--quantization ${quant_type} \
- --gpu-memory-utilization 0.85 \
+ --gpu-memory-utilization 0.85 --kv-cache-ratio 0.1 \
--port 9012 --engine-worker-queue-port 8013 --metrics-port 8014 --tensor-parallel-size 4 \
--cache-queue-port 8015 \
--innode-prefill-ports 7013 \
diff --git a/docs/zh/optimal_deployment/ERNIE-4.5-300B-A47B-Paddle.md b/docs/zh/best_practices/ERNIE-4.5-300B-A47B-Paddle.md
similarity index 98%
rename from docs/zh/optimal_deployment/ERNIE-4.5-300B-A47B-Paddle.md
rename to docs/zh/best_practices/ERNIE-4.5-300B-A47B-Paddle.md
index 95fbe3e387..b6ef33a19a 100644
--- a/docs/zh/optimal_deployment/ERNIE-4.5-300B-A47B-Paddle.md
+++ b/docs/zh/best_practices/ERNIE-4.5-300B-A47B-Paddle.md
@@ -2,6 +2,7 @@
## 一、环境准备
### 1.1 支持情况
ERNIE-4.5-300B-A47B各量化精度,在下列硬件上部署所需要的最小卡数如下:
+
| | WINT8 | WINT4 | FP8 | WINT2 | W4A8 |
|-----|-----|-----|-----|-----|-----|
|H800 80GB| 8 | 4 | 8 | 2 | 4 |
@@ -99,7 +100,6 @@ export FD_SAMPLING_CLASS=rejection
**启用方式:** 以单机8GPU,1P1D(各4GPU)部署为例,与默认的混合式部署方式相比, 需要`--splitwise-role`指定节点的角色。并通过环境变量`FD_LOG_DIR`和`CUDA_VISIBLE_DEVICES`将两个节点的GPU 和日志隔离开
```
export FD_LOG_DIR="log_prefill"
-export ENABLE_V1_KVCACHE_SCHEDULER=1
export CUDA_VISIBLE_DEVICES=0,1,2,3
python -m fastdeploy.entrypoints.openai.api_server \
--model baidu/ERNIE-4.5-300B-A47B-Paddle \
@@ -112,7 +112,6 @@ python -m fastdeploy.entrypoints.openai.api_server \
```
```
export FD_LOG_DIR="log_decode"
-export ENABLE_V1_KVCACHE_SCHEDULER=1
export CUDA_VISIBLE_DEVICES=4,5,6,7
# 注意innode-prefill-ports指定为Prefill服务的engine-worker-queue-port
python -m fastdeploy.entrypoints.openai.api_server \
diff --git a/docs/zh/best_practices/ERNIE-4.5-VL-28B-A3B-Paddle.md b/docs/zh/best_practices/ERNIE-4.5-VL-28B-A3B-Paddle.md
new file mode 100644
index 0000000000..12ebb26965
--- /dev/null
+++ b/docs/zh/best_practices/ERNIE-4.5-VL-28B-A3B-Paddle.md
@@ -0,0 +1,134 @@
+
+# ERNIE-4.5-VL-28B-A3B-Paddle
+
+## 一、环境准备
+### 1.1 支持情况
+在下列硬件上部署所需要的最小卡数如下:
+
+| 设备[显存] | WINT4 | WINT8 | BFLOAT16 |
+|:----------:|:----------:|:------:| :------:|
+| A30 [24G] | 2 | 2 | 4 |
+| L20 [48G] | 1 | 1 | 2 |
+| H20 [144G] | 1 | 1 | 1 |
+| A100 [80G] | 1 | 1 | 1 |
+| H800 [80G] | 1 | 1 | 1 |
+
+### 1.2 安装fastdeploy
+
+安装流程参考文档 [FastDeploy GPU 安装](../get_started/installation/nvidia_gpu.md)
+
+> ⚠️ 注意事项
+> - FastDeploy只支持Paddle格式的模型,注意下载Paddle后缀的模型
+> - 使用模型名称会自动下载模型,如果已经下载过模型,可以直接使用模型下载位置的绝对路径
+
+## 二、如何使用
+### 2.1 基础:启动服务
+ **示例1:** 4090上单卡部署32K上下文的服务
+```shell
+export ENABLE_V1_KVCACHE_SCHEDULER=1
+python -m fastdeploy.entrypoints.openai.api_server \
+ --model baidu/ERNIE-4.5-VL-28B-A3B-Paddle \
+ --port 8180 \
+ --metrics-port 8181 \
+ --engine-worker-queue-port 8182 \
+ --tensor-parallel-size 1 \
+ --max-model-len 32768 \
+ --max-num-seqs 32 \
+ --limit-mm-per-prompt '{"image": 100, "video": 100}' \
+ --reasoning-parser ernie-45-vl \
+ --gpu-memory-utilization 0.9 \
+ --enable-chunked-prefill \
+ --max-num-batched-tokens 384 \
+ --quantization wint4 \
+ --enable-mm
+```
+ **示例2:** H800上双卡部署128K上下文的服务
+```shell
+export ENABLE_V1_KVCACHE_SCHEDULER=1
+python -m fastdeploy.entrypoints.openai.api_server \
+ --model baidu/ERNIE-4.5-VL-28B-A3B-Paddle \
+ --port 8180 \
+ --metrics-port 8181 \
+ --engine-worker-queue-port 8182 \
+ --tensor-parallel-size 2 \
+ --max-model-len 131072 \
+ --max-num-seqs 128 \
+ --limit-mm-per-prompt '{"image": 100, "video": 100}' \
+ --reasoning-parser ernie-45-vl \
+ --gpu-memory-utilization 0.9 \
+ --enable-chunked-prefill \
+ --max-num-batched-tokens 384 \
+ --quantization wint4 \
+ --enable-mm
+```
+> ⚠️ 2.1及以上版本需要通过环境变量开启新调度器 `ENABLE_V1_KVCACHE_SCHEDULER=1`,否则可能会有部分请求最大长度前截断或返空。
+
+示例是可以稳定运行的一组配置,同时也能得到比较好的性能。
+如果对精度、性能有进一步的要求,请继续阅读下面的内容。
+### 2.2 进阶:如何获取更优性能
+
+#### 2.2.1 评估应用场景,正确设置参数
+> **上下文长度**
+- **参数:** `--max-model-len`
+- **描述:** 控制模型可处理的最大上下文长度。
+- **推荐:** 更长的上下文会导致吞吐降低,根据实际情况设置,`ERNIE-4.5-VL-28B-A3B-Paddle`最长支持**128k**(131072)长度的上下文。
+
+ ⚠️ 注:更长的上下文会显著增加GPU显存需求,设置更长的上下文之前确保硬件资源是满足的。
+> **最大序列数量**
+- **参数:** `--max-num-seqs`
+- **描述:** 控制服务可以处理的最大序列数量,支持1~256。
+- **推荐:** 如果您不知道实际应用场景中请求的平均序列数量是多少,我们建议设置为**256**。如果您的应用场景中请求的平均序列数量明显少于256,我们建议设置为一个略大于平均值的较小值,以进一步降低显存占用,优化服务性能。
+
+> **多图、多视频输入**
+- **参数**:`--limit-mm-per-prompt`
+- **描述**:我们的模型支持单次提示词(prompt)中输入多张图片和视频。请使用此参数限制每次请求的图片/视频数量,以确保资源高效利用。
+- **推荐**:我们建议将单次提示词(prompt)中的图片和视频数量均设置为100个,以平衡性能与内存占用。
+
+> **初始化时可用的显存比例**
+- **参数:** `--gpu-memory-utilization`
+- **用处:** 用于控制 FastDeploy 初始化服务的可用显存,默认0.9,即预留10%的显存备用。
+- **推荐:** 推荐使用默认值0.9。如果服务压测时提示显存不足,可以尝试调低该值。
+
+#### 2.2.2 Chunked Prefill
+- **参数:** `--enable-chunked-prefill`
+- **用处:** 开启 `chunked prefill` 可**降低显存峰值**并**提升服务吞吐**。
+
+- **其他相关配置**:
+
+ `--max-num-batched-tokens`:限制每个chunk的最大token数量。多模场景下每个chunk会向上取整保持图片的完整性,因此实际每次推理的总token数会大于该值。我们推荐设置为384。
+
+#### 2.2.3 **量化精度**
+- **参数:** `--quantization`
+
+- **已支持的精度类型:**
+ - WINT4 (适合大多数用户)
+ - WINT8
+ - BFLOAT16 (未设置 `--quantization` 参数时,默认使用BFLOAT16)
+
+- **推荐:**
+ - 除非您有极其严格的精度要求,否则我们建议使用WINT4量化。这将显著降低内存占用并提升吞吐量。
+ - 若需要稍高的精度,可尝试WINT8。
+ - 仅当您的应用场景对精度有极致要求时候才尝试使用BFLOAT16,因为它需要更多显存。
+
+#### 2.2.4 **可调整的环境变量**
+> **拒绝采样:**`FD_SAMPLING_CLASS=rejection`
+- **描述**:拒绝采样即从一个易于采样的提议分布(proposal distribution)中生成样本,避免显式排序从而达到提升采样速度的效果,可以提升推理性能。
+- **推荐**:这是一种影响效果的较为激进的优化策略,我们还在全面验证影响。如果对性能有较高要求,也可以接受对效果的影响时可以尝试开启。
+
+> **Attention超参:**`FLAGS_max_partition_size=1024`
+- **描述**:Append Attntion(默认)后端的超参,我们在常用数据集上的测试结果表明,设置为1024后可以大幅提升解码速度,尤其是长文场景。
+- **推荐**:未来会修改为自动调整的机制。如果对性能有较高要求可以尝试开启。
+
+## 三、常见问题FAQ
+**注意:** 使用多模服务部署需要在配置中添加参数 `--enable-mm`。
+
+### 3.1 显存不足(OOM)
+如果服务启动时提示显存不足,请尝试以下方法:
+1. 确保无其他进程占用显卡显存;
+2. 使用WINT4/WINT8量化,开启chunked prefill;
+3. 酌情降低上下文长度和最大序列数量;
+4. 增加部署卡数,使用2卡或4卡部署,即修改参数 `--tensor-parallel-size 2` 或 `--tensor-parallel-size 4`。
+
+如果可以服务可以正常启动,运行时提示显存不足,请尝试以下方法:
+1. 酌情降低初始化时可用的显存比例,即调整参数 `--gpu-memory-utilization` 的值;
+2. 增加部署卡数,参数修改同上。
diff --git a/docs/zh/best_practices/ERNIE-4.5-VL-424B-A47B-Paddle.md b/docs/zh/best_practices/ERNIE-4.5-VL-424B-A47B-Paddle.md
new file mode 100644
index 0000000000..bb83c02fe4
--- /dev/null
+++ b/docs/zh/best_practices/ERNIE-4.5-VL-424B-A47B-Paddle.md
@@ -0,0 +1,109 @@
+
+# ERNIE-4.5-VL-424B-A47B-Paddle
+
+## 一、环境准备
+### 1.1 支持情况
+在下列硬件上部署所需要的最小卡数如下:
+
+| 设备[显存] | WINT4 | WINT8 | BFLOAT16 |
+|:----------:|:----------:|:------:| :------:|
+| H20 [144G] | 8 | 8 | 8 |
+| A100 [80G] | 8 | 8 | - |
+| H800 [80G] | 8 | 8 | - |
+
+### 1.2 安装fastdeploy
+
+安装流程参考文档 [FastDeploy GPU 安装](../get_started/installation/nvidia_gpu.md)
+
+> ⚠️ 注意事项
+> - FastDeploy只支持Paddle格式的模型,注意下载Paddle后缀的模型
+> - 使用模型名称会自动下载模型,如果已经下载过模型,可以直接使用模型下载位置的绝对路径
+
+## 二、如何使用
+### 2.1 基础:启动服务
+ **示例1:** H800上8卡部署128K上下文的服务
+```shell
+python -m fastdeploy.entrypoints.openai.api_server \
+ --model baidu/ERNIE-4.5-VL-424B-A47B-Paddle \
+ --port 8180 \
+ --metrics-port 8181 \
+ --engine-worker-queue-port 8182 \
+ --tensor-parallel-size 8 \
+ --max-model-len 131072 \
+ --max-num-seqs 16 \
+ --limit-mm-per-prompt '{"image": 100, "video": 100}' \
+ --reasoning-parser ernie-45-vl \
+ --gpu-memory-utilization 0.8 \
+ --enable-chunked-prefill \
+ --max-num-batched-tokens 384 \
+ --quantization wint4 \
+ --enable-mm
+```
+> ⚠️ 2.1及以上版本需要通过环境变量开启新调度器 `ENABLE_V1_KVCACHE_SCHEDULER=1`,否则可能会有部分请求最大长度前截断或返空。
+
+示例是可以稳定运行的一组配置,同时也能得到比较好的性能。
+如果对精度、性能有进一步的要求,请继续阅读下面的内容。
+### 2.2 进阶:如何获取更优性能
+
+#### 2.2.1 评估应用场景,正确设置参数
+> **上下文长度**
+- **参数:** `--max-model-len`
+- **描述:** 控制模型可处理的最大上下文长度。
+- **推荐:** 更长的上下文会导致吞吐降低,根据实际情况设置,`ERNIE-4.5-VL-424B-A47B-Paddle` 最长支持**128k**(131072)长度的上下文。
+
+> **最大序列数量**
+- **参数:** `--max-num-seqs`
+- **描述:** 控制服务可以处理的最大序列数量,支持1~256。
+- **推荐:** 128k场景下,80G显存的单机我们建议设置为**16**。
+
+> **多图、多视频输入**
+- **参数**:`--limit-mm-per-prompt`
+- **描述**:我们的模型支持单次提示词(prompt)中输入多张图片和视频。请使用此参数限制每次请求的图片/视频数量,以确保资源高效利用。
+- **推荐**:我们建议将单次提示词(prompt)中的图片和视频数量均设置为100个,以平衡性能与内存占用。
+
+> **初始化时可用的显存比例**
+- **参数:** `--gpu-memory-utilization`
+- **用处:** 用于控制 FastDeploy 初始化服务的可用显存,默认0.9,即预留10%的显存备用。
+- **推荐:** 128k长度的上下文时推荐使用0.8。如果服务压测时提示显存不足,可以尝试调低该值。
+
+#### 2.2.2 Chunked Prefill
+- **参数:** `--enable-chunked-prefill`
+- **用处:** 开启 `chunked prefill` 可**降低显存峰值**并**提升服务吞吐**。
+
+- **其他相关配置**:
+
+ `--max-num-batched-tokens`:限制每个chunk的最大token数量。多模场景下每个chunk会向上取整保持图片的完整性,因此实际每次推理的总token数会大于该值。推荐设置为384。
+
+#### 2.2.3 **量化精度**
+- **参数:** `--quantization`
+
+- **已支持的精度类型:**
+ - WINT4 (适合大多数用户)
+ - WINT8
+ - BFLOAT16 (未设置 `--quantization` 参数时,默认使用BFLOAT16)
+
+- **推荐:**
+ - 除非您有极其严格的精度要求,否则我们建议使用WINT4量化。这将显著降低内存占用并提升吞吐量。
+ - 若需要稍高的精度,可尝试WINT8。
+ - 仅当您的应用场景对精度有极致要求时候才尝试使用BFLOAT16,因为它需要更多显存。
+
+#### 2.2.4 **可调整的环境变量**
+> **拒绝采样:**`FD_SAMPLING_CLASS=rejection`
+- **描述**:拒绝采样即从一个易于采样的提议分布(proposal distribution)中生成样本,避免显式排序从而达到提升采样速度的效果,可以提升推理性能。
+- **推荐**:这是一种影响效果的较为激进的优化策略,我们还在全面验证影响。如果对性能有较高要求,也可以接受对效果的影响时可以尝试开启。
+
+> **Attention超参:**`FLAGS_max_partition_size=1024`
+- **描述**:Append Attntion(默认)后端的超参,我们在常用数据集上的测试结果表明,设置为1024后可以大幅提升解码速度,尤其是长文场景。
+- **推荐**:未来会修改为自动调整的机制。如果对性能有较高要求可以尝试开启。
+
+## 三、常见问题FAQ
+**注意:** 使用多模服务部署需要在配置中添加参数 `--enable-mm`。
+
+### 3.1 显存不足(OOM)
+如果服务启动时提示显存不足,请尝试以下方法:
+1. 确保无其他进程占用显卡显存;
+2. 使用WINT4/WINT8量化,开启chunked prefill;
+3. 酌情降低上下文长度和最大序列数量。
+
+如果可以服务可以正常启动,运行时提示显存不足,请尝试以下方法:
+1. 酌情降低初始化时可用的显存比例,即调整参数 `--gpu-memory-utilization` 的值。
diff --git a/docs/zh/optimal_deployment/FAQ.md b/docs/zh/best_practices/FAQ.md
similarity index 100%
rename from docs/zh/optimal_deployment/FAQ.md
rename to docs/zh/best_practices/FAQ.md
diff --git a/docs/zh/features/plugins.md b/docs/zh/features/plugins.md
new file mode 100644
index 0000000000..040233ef85
--- /dev/null
+++ b/docs/zh/features/plugins.md
@@ -0,0 +1,85 @@
+# FastDeploy 插件机制说明文档
+
+FastDeploy 支持插件机制,允许用户在不修改核心代码的前提下扩展功能。插件通过 Python 的 `entry_points` 机制实现自动发现与加载。
+
+## 插件工作原理
+
+插件本质上是在 FastDeploy 启动时被自动调用的注册函数。系统使用 `load_plugins_by_group` 函数确保所有进程(包括分布式训练场景下的子进程)在正式运行前都已加载所需的插件。
+
+## 插件发现机制
+
+FastDeploy 利用 Python 的 `entry_points` 机制来发现并加载插件。开发者需在自己的项目中将插件注册到指定的 entry point 组中。
+
+### 示例:创建一个插件
+
+#### 1. 编写插件逻辑
+
+假设你有一个自定义模型类 `MyModelForCasualLM` 和预训练类 `MyPretrainedModel`,你可以编写如下注册函数:
+
+```python
+# 文件:fd_add_dummy_model/__init__.py
+from fastdeploy.model_registry import ModelRegistry
+from my_custom_model import MyModelForCasualLM, MyPretrainedModel
+
+def register():
+ if "MyModelForCasualLM" not in ModelRegistry.get_supported_archs():
+ ModelRegistry.register_model_class(MyModelForCasualLM)
+ ModelRegistry.register_pretrained_model(MyPretrainedModel)
+```
+
+#### 2. 注册插件到 `setup.py`
+
+```python
+# setup.py
+from setuptools import setup
+
+setup(
+ name="fastdeploy-plugins",
+ version="0.1",
+ packages=["fd_add_dummy_model"],
+ entry_points={
+ "fastdeploy.model_register_plugins": [
+ "fd_add_dummy_model = fd_add_dummy_model:register",
+ ],
+ },
+)
+```
+
+## 插件结构说明
+
+插件由三部分组成:
+
+| 组件 | 说明 |
+|------|------|
+| **插件组(Group)** | 插件所属的功能分组,例如:
- `fastdeploy.model_register_plugins`: 用于注册模型
- `fastdeploy.model_runner_plugins`: 用于注册模型运行器
用户可根据需要自定义分组。 |
+| **插件名(Name)** | 每个插件的唯一标识名(如 `fd_add_dummy_model`),可通过环境变量 `FD_PLUGINS` 控制是否加载该插件。 |
+| **插件值(Value)** | 格式为 `模块名:函数名`,指向实际执行注册逻辑的入口函数。 |
+
+## 控制插件加载行为
+
+默认情况下,FastDeploy 会加载所有已注册的插件。若只想加载特定插件,可以设置环境变量:
+
+```bash
+export FD_PLUGINS=fastdeploy-plugins
+```
+
+多个插件名之间可以用逗号分隔:
+
+```bash
+export FD_PLUGINS=plugin_a,plugin_b
+```
+
+## 参考示例
+
+请参见项目目录下的示例插件实现:
+```
+./test/plugins/
+```
+
+其中包含完整的插件结构和 `setup.py` 配置示例。
+
+## 总结
+
+通过插件机制,用户可以轻松地为 FastDeploy 添加自定义模型或功能模块,而无需修改核心源码。这不仅提升了系统的可扩展性,也方便了第三方开发者进行功能拓展。
+
+如需进一步开发插件,请参考 FastDeploy 源码中的 `model_registry` 和 `plugin_loader` 模块。
diff --git a/docs/zh/get_started/ernie-4.5-vl.md b/docs/zh/get_started/ernie-4.5-vl.md
index 3922c899f9..6fed957d4b 100644
--- a/docs/zh/get_started/ernie-4.5-vl.md
+++ b/docs/zh/get_started/ernie-4.5-vl.md
@@ -23,6 +23,7 @@
**注意**: 由于模型参数量为424B-A47B,在80G * 8卡的机器上,需指定```--quantization wint4```(wint8也可部署)。
```shell
+export ENABLE_V1_KVCACHE_SCHEDULER=1
python -m fastdeploy.entrypoints.openai.api_server \
--model baidu/ERNIE-4.5-VL-424B-A47B-Paddle \
--port 8180 --engine-worker-queue-port 8181 \
@@ -31,7 +32,6 @@ python -m fastdeploy.entrypoints.openai.api_server \
--quantization wint4 \
--max-model-len 32768 \
--max-num-seqs 32 \
- --enable-mm \
--mm-processor-kwargs '{"video_max_frames": 30}' \
--limit-mm-per-prompt '{"image": 10, "video": 3}' \
--reasoning-parser ernie-45-vl
diff --git a/docs/zh/get_started/installation/Enflame_gcu.md b/docs/zh/get_started/installation/Enflame_gcu.md
index cc1042e753..b71a97a8a2 100644
--- a/docs/zh/get_started/installation/Enflame_gcu.md
+++ b/docs/zh/get_started/installation/Enflame_gcu.md
@@ -52,24 +52,21 @@ bash TopsRider_i3x_*_deb_amd64.run --driver --no-auto-load -y
docker start paddle-gcu-llm
docker exec -it paddle-gcu-llm bash
```
-5. 安装 PaddlePaddle & PaddleCustomDevice
+5. 安装 PaddlePaddle
```bash
# PaddlePaddle『飞桨』深度学习框架,提供运算基础能力
-python -m pip install paddlepaddle==3.1.1 -i https://www.paddlepaddle.org.cn/packages/stable/cpu/
-
+python -m pip install paddlepaddle==3.1.0a0 -i https://www.paddlepaddle.org.cn/packages/stable/cpu/
+```
+6. 安装 PaddleCustomDevice
+```bash
# PaddleCustomDevice是PaddlePaddle『飞桨』深度学习框架的自定义硬件接入实现,提供GCU的算子实现
-python -m pip install paddle-custom-gcu==3.1.1 -i https://www.paddlepaddle.org.cn/packages/stable/gcu/
+python -m pip install paddle-custom-gcu==3.1.0 -i https://www.paddlepaddle.org.cn/packages/stable/gcu/
# 如想源码编译安装,请参考https://github.com/PaddlePaddle/PaddleCustomDevice/blob/develop/backends/gcu/README_cn.md
```
-获取Paddle的最新安装版本: [PaddlePaddle Installation](https://www.paddlepaddle.org.cn/)
-
-6. 安装 FastDeploy
+7. 安装 FastDeploy
```bash
python -m pip install fastdeploy -i https://www.paddlepaddle.org.cn/packages/stable/gcu/ --extra-index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simplels
-```
-
-可以按如下步骤编译FastDeploy,得到```最新版本```.
-```bash
+# 如想源码编译安装,请参考如下步骤
git clone https://github.com/PaddlePaddle/FastDeploy
cd FastDeploy
python -m pip install -r requirements.txt --extra-index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simplels
diff --git a/docs/zh/get_started/installation/README.md b/docs/zh/get_started/installation/README.md
index 80638604b6..68fdbec52d 100644
--- a/docs/zh/get_started/installation/README.md
+++ b/docs/zh/get_started/installation/README.md
@@ -1,6 +1,6 @@
-# FastDeploy Installation Guide
+# FastDeploy 安装
-FastDeploy currently supports installation on the following hardware platforms:
+FastDeploy支持如下硬件平台:
- [NVIDIA GPU Installation](nvidia_gpu.md)
- [Kunlunxin XPU Installation](kunlunxin_xpu.md)
diff --git a/docs/zh/get_started/installation/iluvatar_gpu.md b/docs/zh/get_started/installation/iluvatar_gpu.md
index 79d9a7779a..c01ffe93a5 100644
--- a/docs/zh/get_started/installation/iluvatar_gpu.md
+++ b/docs/zh/get_started/installation/iluvatar_gpu.md
@@ -1,8 +1,9 @@
# 如何在天数机器上运行 ERNIE-4.5-300B-A47B-BF16 & ERNIE-4.5-21B-A3B
-该软件的当前版本仅作为Iluvatar CoreX与大型模型的Fastdeploy推理框架相结合的演示。在GSM8K数据集上运行最新的ERNIE4.5 300B模型大约需要6.3小时。
+当前版本软件只是作为天数芯片 + Fastdeploy 推理大模型的一个演示 demo,跑最新ERNIE4.5模型可能存在问题,后续进行修复和性能优化,给客户提供一个更稳定的版本。
## 准备机器
首先您需要准备以下配置的机器
+
| CPU | 内存 | 天数 | 硬盘|
|-----|------|-----|-----|
| x86 | 1TB| 8xBI150| 1TB|
@@ -17,7 +18,7 @@ docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/device/paddle-ixuca:latest
```
## 准备容器
-### 启动容器
+1. 启动容器
```bash
docker run -itd --name paddle_infer -v /usr/src:/usr/src -v /lib/modules:/lib/modules -v /dev:/dev -v /home/paddle:/home/paddle --privileged --cap-add=ALL --pid=host ccr-2vdh3abv-pub.cnc.bj.baidubce.com/device/paddle-ixuca:latest
@@ -26,25 +27,12 @@ docker exec -it paddle_infer bash
/home/paddle 为模型文件、whl包、脚本所在目录
-### 安装paddle
-
-```bash
-pip3 install paddlepaddle==3.1.1 -i https://www.paddlepaddle.org.cn/packages/stable/cpu/
-pip3 install paddle-iluvatar-gpu==3.1.1 -i https://www.paddlepaddle.org.cn/packages/stable/ixuca/
-```
-获取Paddle的最新安装版本: [PaddlePaddle Installation](https://www.paddlepaddle.org.cn/)
+1. 安装whl包
-### 安装fastdeploy
```bash
-pip3 install fastdeploy_iluvatar_gpu==2.1.0.dev0 -i https://www.paddlepaddle.org.cn/packages/stable/ixuca/ --extra-index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simplels
-```
-可以按如下步骤编译FastDeploy,,得到```最新版本```.
-```bash
-git clone https://github.com/PaddlePaddle/FastDeploy
-cd FastDeploy
-pip install -r requirements_iluvatar.txt
-export LD_PRELOAD=/usr/local/corex/lib64/libcuda.so.1
-bash build.sh
+pip3 install paddlepaddle==3.1.0a0 -i https://www.paddlepaddle.org.cn/packages/stable/cpu/
+pip3 install paddle-iluvatar-gpu==3.1.0 -i https://www.paddlepaddle.org.cn/packages/stable/ixuca/
+pip3 install fastdeploy_iluvatar_gpu -i https://www.paddlepaddle.org.cn/packages/stable/ixuca/ --extra-index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simplels
```
## 准备推理demo脚本
diff --git a/docs/zh/get_started/installation/kunlunxin_xpu.md b/docs/zh/get_started/installation/kunlunxin_xpu.md
index c14f49f5f6..29fb801fc5 100644
--- a/docs/zh/get_started/installation/kunlunxin_xpu.md
+++ b/docs/zh/get_started/installation/kunlunxin_xpu.md
@@ -6,7 +6,7 @@
- Python:3.10
- XPU 型号:P800
- XPU 驱动版本:≥ 5.0.21.26
-- XPU 固件版本:≥ 1.31
+- XPU 固件版本:≥ 1.48
已验证的平台:
- CPU:INTEL(R) XEON(R) PLATINUM 8563C / Hygon C86-4G 7490 64-core Processor
@@ -16,7 +16,7 @@
- Python:3.10
- XPU 型号:P800(OAM 版)
- XPU 驱动版本:5.0.21.26
-- XPU 固件版本:1.31
+- XPU 固件版本:1.48
**注:** 目前只验证过 INTEL 或海光 CPU OAM 版 P800 服务器,暂未验证其它 CPU 和 PCIe 版 P800 服务器。
diff --git a/docs/zh/get_started/installation/nvidia_gpu.md b/docs/zh/get_started/installation/nvidia_gpu.md
index 94c111fe1b..a370a4589a 100644
--- a/docs/zh/get_started/installation/nvidia_gpu.md
+++ b/docs/zh/get_started/installation/nvidia_gpu.md
@@ -15,7 +15,7 @@
**注意**: 如下镜像仅支持SM 80/90架构GPU(A800/H800等),如果你是在L20/L40/4090等SM 86/69架构的GPU上部署,请在创建容器后,卸载```fastdeploy-gpu```再重新安装如下文档指定支持86/89架构的`fastdeploy-gpu`包。
``` shell
-docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/fastdeploy-cuda-12.6:2.0.0
+docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/fastdeploy-cuda-12.6:2.1.0
```
## 2. 预编译Pip安装
@@ -23,7 +23,7 @@ docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/fastdeploy-cuda-12
首先安装 paddlepaddle-gpu,详细安装方式参考 [PaddlePaddle安装](https://www.paddlepaddle.org.cn/en/install/quick?docurl=/documentation/docs/en/develop/install/pip/linux-pip_en.html)
``` shell
-python -m pip install paddlepaddle-gpu==3.1.0 -i https://www.paddlepaddle.org.cn/packages/stable/cu126/
+python -m pip install paddlepaddle-gpu==3.1.1 -i https://www.paddlepaddle.org.cn/packages/stable/cu126/
```
再安装 fastdeploy,**注意不要通过pypi源安装**,需要通过如下方式安装
@@ -64,7 +64,7 @@ docker build -f dockerfiles/Dockerfile.gpu -t fastdeploy:gpu .
首先安装 paddlepaddle-gpu,详细安装方式参考 [PaddlePaddle安装](https://www.paddlepaddle.org.cn/)
``` shell
-python -m pip install paddlepaddle-gpu==3.1.0 -i https://www.paddlepaddle.org.cn/packages/stable/cu126/
+python -m pip install paddlepaddle-gpu==3.1.1 -i https://www.paddlepaddle.org.cn/packages/stable/cu126/
```
接着克隆源代码,编译安装
diff --git a/docs/zh/get_started/quick_start_vl.md b/docs/zh/get_started/quick_start_vl.md
index 0f4c88cc19..b031378acb 100644
--- a/docs/zh/get_started/quick_start_vl.md
+++ b/docs/zh/get_started/quick_start_vl.md
@@ -19,6 +19,7 @@
安装FastDeploy后,在终端执行如下命令,启动服务,其中启动命令配置方式参考[参数说明](../parameters.md)
```shell
+export ENABLE_V1_KVCACHE_SCHEDULER=1
python -m fastdeploy.entrypoints.openai.api_server \
--model baidu/ERNIE-4.5-VL-28B-A3B-Paddle \
--port 8180 \
@@ -26,8 +27,7 @@ python -m fastdeploy.entrypoints.openai.api_server \
--engine-worker-queue-port 8182 \
--max-model-len 32768 \
--max-num-seqs 32 \
- --reasoning-parser ernie-45-vl \
- --enable-mm
+ --reasoning-parser ernie-45-vl
```
>💡 注意:在 ```--model``` 指定的路径中,若当前目录下不存在该路径对应的子目录,则会尝试根据指定的模型名称(如 ```baidu/ERNIE-4.5-0.3B-Base-Paddle```)查询AIStudio是否存在预置模型,若存在,则自动启动下载。默认的下载路径为:```~/xx```。关于模型自动下载的说明和配置参阅[模型下载](../supported_models.md)。
diff --git a/docs/zh/index.md b/docs/zh/index.md
index 312b3aed97..73bf10fa96 100644
--- a/docs/zh/index.md
+++ b/docs/zh/index.md
@@ -13,12 +13,12 @@
| Model | Data Type | PD Disaggregation | Chunked Prefill | Prefix Caching | MTP | CUDA Graph | Maximum Context Length |
|:--- | :------- | :---------- | :-------- | :-------- | :----- | :----- | :----- |
-|ERNIE-4.5-300B-A47B | BF16/WINT4/WINT8/W4A8C8/WINT2/FP8 | ✅| ✅ | ✅|✅(WINT4)| WIP |128K |
-|ERNIE-4.5-300B-A47B-Base| BF16/WINT4/WINT8 | ✅| ✅ | ✅|✅(WINT4)| WIP | 128K |
+|ERNIE-4.5-300B-A47B | BF16/WINT4/WINT8/W4A8C8/WINT2/FP8 | ✅| ✅ | ✅|✅| WIP |128K |
+|ERNIE-4.5-300B-A47B-Base| BF16/WINT4/WINT8 | ✅| ✅ | ✅|❌| WIP | 128K |
|ERNIE-4.5-VL-424B-A47B | BF16/WINT4/WINT8 | WIP | ✅ | WIP | ❌ | WIP |128K |
|ERNIE-4.5-VL-28B-A3B | BF16/WINT4/WINT8 | ❌ | ✅ | WIP | ❌ | WIP |128K |
-|ERNIE-4.5-21B-A3B | BF16/WINT4/WINT8/FP8 | ❌ | ✅ | ✅ | WIP | ✅|128K |
-|ERNIE-4.5-21B-A3B-Base | BF16/WINT4/WINT8/FP8 | ❌ | ✅ | ✅ | WIP | ✅|128K |
+|ERNIE-4.5-21B-A3B | BF16/WINT4/WINT8/FP8 | ❌ | ✅ | ✅ | ✅ | ✅|128K |
+|ERNIE-4.5-21B-A3B-Base | BF16/WINT4/WINT8/FP8 | ❌ | ✅ | ✅ | ❌ | ✅|128K |
|ERNIE-4.5-0.3B | BF16/WINT8/FP8 | ❌ | ✅ | ✅ | ❌ | ✅| 128K |
## 文档说明
diff --git a/docs/zh/offline_inference.md b/docs/zh/offline_inference.md
index 7dc8e195e0..a773114957 100644
--- a/docs/zh/offline_inference.md
+++ b/docs/zh/offline_inference.md
@@ -39,7 +39,7 @@ for output in outputs:
```python
from fastdeploy.entrypoints.llm import LLM
# 加载模型
-llm = LLM(model="baidu/ERNIE-4.5-VL-28B-A3B-Paddle", tensor_parallel_size=1, max_model_len=32768, enable_mm=True, limit_mm_per_prompt={"image": 100}, reasoning_parser="ernie-45-vl")
+llm = LLM(model="baidu/ERNIE-4.5-VL-28B-A3B-Paddle", tensor_parallel_size=1, max_model_len=32768, limit_mm_per_prompt={"image": 100}, reasoning_parser="ernie-45-vl")
outputs = llm.chat(
messages=[
@@ -127,7 +127,7 @@ for message in messages:
})
sampling_params = SamplingParams(temperature=0.1, max_tokens=6400)
-llm = LLM(model=PATH, tensor_parallel_size=1, max_model_len=32768, enable_mm=True, limit_mm_per_prompt={"image": 100}, reasoning_parser="ernie-45-vl")
+llm = LLM(model=PATH, tensor_parallel_size=1, max_model_len=32768, limit_mm_per_prompt={"image": 100}, reasoning_parser="ernie-45-vl")
outputs = llm.generate(prompts={
"prompt": prompt,
"multimodal_data": {
diff --git a/docs/zh/parameters.md b/docs/zh/parameters.md
index fbf57a971c..177a2d97bd 100644
--- a/docs/zh/parameters.md
+++ b/docs/zh/parameters.md
@@ -6,6 +6,8 @@
|:-----------------------------------|:----------| :----- |
| ```port``` | `int` | 仅服务化部署需配置,服务HTTP请求端口号,默认8000 |
| ```metrics_port``` | `int` | 仅服务化部署需配置,服务监控Metrics端口号,默认8001 |
+| ```max_waiting_time``` | `int` | 仅服务化部署需配置,服务请求建立连接最大等待时间,默认-1 表示无等待时间限制|
+| ```max_concurrency``` | `int` | 仅服务化部署需配置,服务实际建立连接数目,默认512 |
| ```engine_worker_queue_port``` | `int` | FastDeploy内部引擎进程通信端口, 默认8002 |
| ```cache_queue_port``` | `int` | FastDeploy内部KVCache进程通信端口, 默认8003 |
| ```max_model_len``` | `int` | 推理默认最大支持上下文长度,默认2048 |
@@ -17,7 +19,7 @@
| ```tokenizer``` | `str` | tokenizer 名或路径,默认为模型路径 |
| ```use_warmup``` | `int` | 是否在启动时进行warmup,会自动生成极限长度数据进行warmup,默认自动计算KV Cache时会使用 |
| ```limit_mm_per_prompt``` | `dict[str]` | 限制每个prompt中多模态数据的数量,如:{"image": 10, "video": 3},默认都为1 |
-| ```enable_mm``` | `bool` | 是否支持多模态数据(仅针对多模模型),默认False |
+| ```enable_mm``` | `bool` | __[已废弃]__ 是否支持多模态数据(仅针对多模模型),默认False |
| ```quantization``` | `str` | 模型量化策略,当在加载BF16 CKPT时,指定wint4或wint8时,支持无损在线4bit/8bit量化 |
| ```gpu_memory_utilization``` | `float` | GPU显存利用率,默认0.9 |
| ```num_gpu_blocks_override``` | `int` | 预分配KVCache块数,此参数可由FastDeploy自动根据显存情况计算,无需用户配置,默认为None |
diff --git a/docs/zh/usage/environment_variables.md b/docs/zh/usage/environment_variables.md
index 8037c33624..cda1fc4f07 100644
--- a/docs/zh/usage/environment_variables.md
+++ b/docs/zh/usage/environment_variables.md
@@ -1,4 +1,5 @@
# FastDeploy 环境变量说明
+
FastDeploy 的环境变量保存在了代码库根目录下 fastdeploy/envs.py 文件中,以下是其对应的中文版说明:
```python
@@ -37,7 +38,7 @@ environment_variables: dict[str, Callable[[], Any]] = {
# 是否使用 HuggingFace 分词器
"FD_USE_HF_TOKENIZER":
- lambda: os.getenv("FD_USE_HF_TOKENIZER", 0),
+ lambda: bool(int(os.getenv("FD_USE_HF_TOKENIZER", 0))),
# 设置 ZMQ 初始化期间接收数据的高水位标记(HWM)
"FD_ZMQ_SNDHWM":
diff --git a/docs/zh/usage/kunlunxin_xpu_deployment.md b/docs/zh/usage/kunlunxin_xpu_deployment.md
index aabfd14925..b894814011 100644
--- a/docs/zh/usage/kunlunxin_xpu_deployment.md
+++ b/docs/zh/usage/kunlunxin_xpu_deployment.md
@@ -3,7 +3,7 @@
|-|-|-|-|-|-|
|ERNIE-4.5-300B-A47B|32K|WINT8|8|export XPU_VISIBLE_DEVICES="0,1,2,3,4,5,6,7"
python -m fastdeploy.entrypoints.openai.api_server \
--model PaddlePaddle/ERNIE-4.5-300B-A47B-Paddle \
--port 8188 \
--tensor-parallel-size 8 \
--max-model-len 32768 \
--max-num-seqs 64 \
--quantization "wint8" \
--gpu-memory-utilization 0.9|>=2.0.3|
|ERNIE-4.5-300B-A47B|32K|WINT4|4 (推荐)|export XPU_VISIBLE_DEVICES="0,1,2,3" or "4,5,6,7"
python -m fastdeploy.entrypoints.openai.api_server \
--model PaddlePaddle/ERNIE-4.5-300B-A47B-Paddle \
--port 8188 \
--tensor-parallel-size 4 \
--max-model-len 32768 \
--max-num-seqs 64 \
--quantization "wint4" \
--gpu-memory-utilization 0.9|>=2.0.0|
-|ERNIE-4.5-300B-A47B|32K|WINT4|8|export XPU_VISIBLE_DEVICES="0,1,2,3,4,5,6,7"
python -m fastdeploy.entrypoints.openai.api_server \
--model PaddlePaddle/ERNIE-4.5-300B-A47B-Paddle \
--port 8188 \
--tensor-parallel-size 8 \
--max-model-len 32768 \
--max-num-seqs 64 \
--quantization "wint4" \
--gpu-memory-utilization 0.9|>=2.0.0|
+|ERNIE-4.5-300B-A47B|32K|WINT4|8|export XPU_VISIBLE_DEVICES="0,1,2,3,4,5,6,7"
python -m fastdeploy.entrypoints.openai.api_server \
--model PaddlePaddle/ERNIE-4.5-300B-A47B-Paddle \
--port 8188 \
--tensor-parallel-size 8 \
--max-model-len 32768 \
--max-num-seqs 64 \
--quantization "wint4" \
--gpu-memory-utilization 0.95|>=2.0.0|
|ERNIE-4.5-300B-A47B|128K|WINT4|8 (推荐)|export XPU_VISIBLE_DEVICES="0,1,2,3,4,5,6,7"
python -m fastdeploy.entrypoints.openai.api_server \
--model PaddlePaddle/ERNIE-4.5-300B-A47B-Paddle \
--port 8188 \
--tensor-parallel-size 8 \
--max-model-len 131072 \
--max-num-seqs 64 \
--quantization "wint4" \
--gpu-memory-utilization 0.9|>=2.0.0|
|ERNIE-4.5-21B-A3B|32K|BF16|1|export XPU_VISIBLE_DEVICES="0" # 指定任意一张卡
python -m fastdeploy.entrypoints.openai.api_server \
--model PaddlePaddle/ERNIE-4.5-21B-A3B-Paddle \
--port 8188 \
--tensor-parallel-size 1 \
--max-model-len 32768 \
--max-num-seqs 128 \
--gpu-memory-utilization 0.9|>=2.1.0|
|ERNIE-4.5-21B-A3B|32K|WINT8|1|export XPU_VISIBLE_DEVICES="0" # 指定任意一张卡
python -m fastdeploy.entrypoints.openai.api_server \
--model PaddlePaddle/ERNIE-4.5-21B-A3B-Paddle \
--port 8188 \
--tensor-parallel-size 1 \
--max-model-len 32768 \
--max-num-seqs 128 \
--quantization "wint8" \
--gpu-memory-utilization 0.9|>=2.1.0|
@@ -89,4 +89,4 @@ for chunk in response:
print('\n')
```
-OpenAI 协议的更多说明可参考文档 [OpenAI Chat Compeltion API](https://platform.openai.com/docs/api-reference/chat/create),以及与 OpenAI 协议的区别可以参考 [兼容 OpenAI 协议的服务化部署](../../online_serving/README.md)。
+OpenAI 协议的更多说明可参考文档 [OpenAI Chat Compeltion API](https://platform.openai.com/docs/api-reference/chat/create),以及与 OpenAI 协议的区别可以参考 [兼容 OpenAI 协议的服务化部署](../online_serving/README.md)。
diff --git a/mkdocs.yml b/mkdocs.yml
index 9ab270d1e9..443659f6d1 100644
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -1,13 +1,103 @@
-site_name: 'FastDeploy 2.0: Large Language Model Deployement'
+site_name: 'FastDeploy : Large Language Model Deployement'
+repo_url: https://github.com/PaddlePaddle/FastDeploy
+repo_name: FastDeploy
+
+theme:
+ name: material
+ highlightjs: true
+ icon:
+ repo: fontawesome/brands/github
+ palette:
+ - media: "(prefers-color-scheme: light)" # 浅色
+ scheme: default
+ primary: indigo
+ accent: indigo
+ toggle:
+ icon: material/brightness-7
+ name: Switch to dark mode
+ - media: "(prefers-color-scheme: dark)" # 深色
+ scheme: slate
+ primary: black
+ accent: indigo
+ toggle:
+ icon: material/brightness-4
+ name: Switch to system preference
+
+plugins:
+ - search
+ - i18n:
+ docs_structure: folder
+ fallback_to_default: true
+ reconfigure_material: true
+ reconfigure_search: true
+ languages:
+ - locale: en
+ default: true
+ name: English
+ site_name: 'FastDeploy: Large Language Model Deployement'
+ build: true
+ - locale: zh
+ name: 简体中文
+ site_name: 飞桨大语言模型推理部署工具包
+ link: /./zh/
+ nav_translations:
+ FastDeploy: FastDeploy
+ Quick Start: 快速入门
+ Installation: 安装
+ Nvidia GPU: 英伟达 GPU
+ KunlunXin XPU: 昆仑芯 XPU
+ HYGON DCU: 海光 DCU
+ Enflame S60: 燧原 S60
+ Iluvatar CoreX: 天数 CoreX
+ Quick Deployment For ERNIE-4.5-0.3B: ERNIE-4.5-0.3B快速部署
+ Quick Deployment for ERNIE-4.5-VL-28B-A3B: ERNIE-4.5-VL-28B-A3B快速部署
+ ERNIE-4.5-300B-A47B: ERNIE-4.5-300B-A47B快速部署
+ ERNIE-4.5-VL-424B-A47B: ERNIE-4.5-VL-424B-A47B快速部署
+ Online Serving: 在线服务
+ OpenAI-Compitable API Server: 兼容 OpenAI 协议的服务化部署
+ Monitor Metrics: 监控Metrics
+ Scheduler: 调度器
+ Offline Inference: 离线推理
+ Best Practices: 最佳实践
+ ERNIE-4.5-0.3B: ERNIE-4.5-0.3B
+ ERNIE-4.5-21B-A3B: ERNIE-4.5-21B-A3B
+ ERNIE-4.5-300B-A47B: ERNIE-4.5-300B-A47B
+ ERNIE-4.5-VL-28B-A3B: ERNIE-4.5-VL-28B-A3B
+ ERNIE-4.5-VL-424B-A47B: ERNIE-4.5-VL-424B-A47B
+ FAQ: 常见问题
+ Quantization: 量化
+ Overview: 概述
+ Online Quantization: 在线量化
+ WINT2 Quantization: WINT2量化
+ Features: 特性
+ Prefix Caching: 前缀缓存
+ Disaggregation: 分离式部署
+ Chunked Prefill: 分块预填充
+ Load Balance: 负载均衡
+ Speculative Decoding: 投机解码
+ Structured Outputs: 结构化输出
+ Reasoning Output: 思考链内容
+ Early Stop: 早停功能
+ Plugins: 插件机制
+ Sampling: 采样策略
+ MultiNode Deployment: 多机部署
+ Supported Models: 支持模型列表
+ Benchmark: 基准测试
+ Usage: 用法
+ Log Description: 日志说明
+ Code Overview: 代码概述
+ Environment Variables: 环境变量
+
nav:
- - 'FastDeploy 2.0': index.md
+ - 'FastDeploy': index.md
- 'Quick Start':
- Installation:
- 'Nvidia GPU': get_started/installation/nvidia_gpu.md
- 'KunlunXin XPU': get_started/installation/kunlunxin_xpu.md
+ - 'HYGON DCU': get_started/installation/hygon_dcu.md
- 'Enflame S60': get_started/installation/Enflame_gcu.md
- 'Iluvatar CoreX': get_started/installation/iluvatar_gpu.md
- - 'Quick Deployment For ERNIE-4.5-0.3B-Paddle': get_started/quick_start.md
+ - 'Quick Deployment For ERNIE-4.5-0.3B': get_started/quick_start.md
- 'Quick Deployment for ERNIE-4.5-VL-28B-A3B': get_started/quick_start_vl.md
- 'ERNIE-4.5-300B-A47B': get_started/ernie-4.5.md
- 'ERNIE-4.5-VL-424B-A47B': get_started/ernie-4.5-vl.md
@@ -16,28 +106,32 @@ nav:
- 'Monitor Metrics': online_serving/metrics.md
- 'Scheduler': online_serving/scheduler.md
- 'Offline Inference': offline_inference.md
- - Quantiation:
+ - Best Practices:
+ - ERNIE-4.5-0.3B: best_practices/ERNIE-4.5-0.3B-Paddle.md
+ - ERNIE-4.5-21B-A3B: best_practices/ERNIE-4.5-21B-A3B-Paddle.md
+ - ERNIE-4.5-300B-A47B: best_practices/ERNIE-4.5-300B-A47B-Paddle.md
+ - ERNIE-4.5-VL-28B-A3B: best_practices/ERNIE-4.5-VL-28B-A3B-Paddle.md
+ - ERNIE-4.5-VL-424B-A47B: best_practices/ERNIE-4.5-VL-424B-A47B-Paddle.md
+ - FAQ: best_practices/FAQ.md
+ - Quantization:
- 'Overview': quantization/README.md
- 'Online Quantization': quantization/online_quantization.md
- 'WINT2 Quantization': quantization/wint2.md
- Features:
- 'Prefix Caching': features/prefix_caching.md
- - 'Disaggration': features/disaggregated.md
+ - 'Disaggregation': features/disaggregated.md
- 'Chunked Prefill': features/chunked_prefill.md
- 'Load Balance': features/load_balance.md
- 'Speculative Decoding': features/speculative_decoding.md
- 'Structured Outputs': features/structured_outputs.md
- 'Reasoning Output': features/reasoning_output.md
+ - 'Early Stop': features/early_stop.md
+ - 'Plugins': features/plugins.md
+ - 'Sampling': features/sampling.md
+ - 'MultiNode Deployment': features/multi-node_deployment.md
- 'Supported Models': supported_models.md
- Benchmark: benchmark.md
- Usage:
- 'Log Description': usage/log.md
- 'Code Overview': usage/code_overview.md
- 'Environment Variables': usage/environment_variables.md
-theme:
- name: 'material'
- highlightjs: true
- icon:
- repo: fontawesome/brands/github
-repo_url: https://github.com/PaddlePaddle/FastDeploy
-repo_name: FastDeploy