diff --git a/docs/best_practices/ERNIE-4.5-0.3B-Paddle.md b/docs/best_practices/ERNIE-4.5-0.3B-Paddle.md index 333cd38437..5acac0d8f0 100644 --- a/docs/best_practices/ERNIE-4.5-0.3B-Paddle.md +++ b/docs/best_practices/ERNIE-4.5-0.3B-Paddle.md @@ -76,7 +76,7 @@ Add the following lines to the startup parameters --use-cudagraph ``` Notes: -1. Usually, no additional parameters need to be set, but CUDAGraph will generate some additional memory overhead, which may need to be adjusted in some scenarios with limited memory. For detailed parameter adjustments, please refer to [GraphOptimizationBackend](../parameters.md) for related configuration parameter descriptions +1. Usually, no additional parameters need to be set, but CUDAGraph will generate some additional memory overhead, which may need to be adjusted in some scenarios with limited memory. For detailed parameter adjustments, please refer to [GraphOptimizationBackend](../features/graph_optimization.md) for related configuration parameter descriptions 2. When CUDAGraph is enabled, if running with multi-GPUs TP>1, `--enable-custom-all-reduce` must be specified at the same time. 3. When CUDAGraph is enabled, the scenario of `max-model-len > 32768` is not currently supported. diff --git a/docs/best_practices/ERNIE-4.5-21B-A3B-Paddle.md b/docs/best_practices/ERNIE-4.5-21B-A3B-Paddle.md index 6c8fb2d5d2..34d3caaa11 100644 --- a/docs/best_practices/ERNIE-4.5-21B-A3B-Paddle.md +++ b/docs/best_practices/ERNIE-4.5-21B-A3B-Paddle.md @@ -86,7 +86,7 @@ Add the following lines to the startup parameters --use-cudagraph ``` Notes: -1. Usually, no additional parameters need to be set, but CUDAGraph will generate some additional memory overhead, which may need to be adjusted in some scenarios with limited memory. For detailed parameter adjustments, please refer to [GraphOptimizationBackend](../parameters.md) for related configuration parameter descriptions +1. Usually, no additional parameters need to be set, but CUDAGraph will generate some additional memory overhead, which may need to be adjusted in some scenarios with limited memory. For detailed parameter adjustments, please refer to [GraphOptimizationBackend](../features/graph_optimization.md) for related configuration parameter descriptions 2. When CUDAGraph is enabled, if running with multi-GPUs TP>1, `--enable-custom-all-reduce` must be specified at the same time. 3. When CUDAGraph is enabled, the scenario of `max-model-len > 32768` is not currently supported. diff --git a/docs/best_practices/ERNIE-4.5-300B-A47B-Paddle.md b/docs/best_practices/ERNIE-4.5-300B-A47B-Paddle.md index 6a8b5af7a1..a95fb1ed2d 100644 --- a/docs/best_practices/ERNIE-4.5-300B-A47B-Paddle.md +++ b/docs/best_practices/ERNIE-4.5-300B-A47B-Paddle.md @@ -135,7 +135,7 @@ Add the following lines to the startup parameters --enable-custom-all-reduce ``` Notes: -1. Usually, no additional parameters need to be set, but CUDAGraph will generate some additional memory overhead, which may need to be adjusted in some scenarios with limited memory. For detailed parameter adjustments, please refer to [GraphOptimizationBackend](../parameters.md) for related configuration parameter descriptions +1. Usually, no additional parameters need to be set, but CUDAGraph will generate some additional memory overhead, which may need to be adjusted in some scenarios with limited memory. For detailed parameter adjustments, please refer to [GraphOptimizationBackend](../features/graph_optimization.md) for related configuration parameter descriptions 2. When CUDAGraph is enabled, if running with multi-GPUs TP>1, `--enable-custom-all-reduce` must be specified at the same time. 3. When CUDAGraph is enabled, the scenario of `max-model-len > 32768` is not currently supported. diff --git a/docs/features/graph_optimization.md b/docs/features/graph_optimization.md new file mode 100644 index 0000000000..fabb1c709b --- /dev/null +++ b/docs/features/graph_optimization.md @@ -0,0 +1,112 @@ +# Graph optimization technology in FastDeploy + +FastDeploy's `GraphOptimizationBackend` integrates a variety of graph optimization technologies: ++ **CUDA Graph**:A mechanism that starts multiple GPU operations with a single CPU operation reduces overhead and improves performance + ++ **StaticGraph to DynamicGraph**:Convert dynamic graphs to static graphs, optimize calculation graphs and improve execution efficiency using global graph structure information + ++ **CINN Neural Network Compiler**:Perform IR conversion, Kernel fusion, Kernel generation and other computational graph compilation optimization methods based on static graphs to achieve comprehensive optimization + +Any dynamic situations such as data-dependent control flow, Host-Device synchronization, model input of address/shape changes, dynamic Kernel execution configuration, etc. will cause CUDAGraph Capture/Replay to fail. The scenarios facing LLM inference are dynamic input lengths, dynamic Batch Size, and flexible Attention implementation and multi-device communication, making CUDAGraph difficult to apply. + +The mainstream open source solution implements CUDA Graph based on static graphs, with a deep technology stack. FastDeploy not only supports static graphs, neural network compilers, and CUDAGraph combination optimization, but also supports directly applying CUDAGraph in dynamic graphs, which has lower development costs, but the dynamic situations faced are more complex. + +FastDeploy's `GraphOptimizationBackend` design architecture is as follows, **some functions are still under development, so it is recommended to read the first chapter carefully using restrictions**. + +![](./images/GraphOptBackendArch.svg) + +## 1. GraphOptimizationBackend Current usage restrictions +In the CUDAGraph multi-device inference task, you need to use the Custom all-reduce operator to perform multi-card all-reduce. + +Before version 2.2, neither the CUDAGraph nor the Custom all-reduce operators were enabled by default. You need to add `--enable-custom-all-reduce` to the startup command to manually enable it. + +### 1.1 The multi-device scene needs to be enabled Custom all-reduce +The `FLAGS_max_partition_size` environment variable controls the `gridDim` execution configuration of Kernel in CascadeAppend Attention, and dynamic execution configuration will cause CUDAGraph execution to fail. +[PR#3223](https://github.com/PaddlePaddle/FastDeploy/pull/3223) Fixed this issue, but it still existed in Release versions before 2.2. + +**Problem self-checking method:** ++ Calculate `div_up(max_model_len, max_partition_size)` based on the value of `FLAGS_max_partition_size` (default is 32K) and `max_model_len` in the startup parameters. The result is greater than `1` and it can run normally when it is equal to `1`. + +**Solution:** +1. Adjust the values of `FLAGS_max_partition_size` and `max_model_len` without triggering dynamic execution of configuration. +2. Close CUDAGraph + +## 2. GraphOptimizationBackend related configuration parameters +Currently, only user configuration of the following parameters is supported: ++ `use_cudagraph` : bool = False ++ `graph_optimization_config` : Dict[str, Any] + + `graph_opt_level`: int = 0 + + `use_cudagraph`: bool = False + + `cudagraph_capture_sizes` : List[int] = None + +CudaGrpah can be enabled by setting `--use-cudagraph` or `--graph-optimization-config '{"use_cudagraph":true}'`. Using two different methods to set the use graph simultaneously may cause conflicts. + +The `graph_opt_level` parameter within `--graph-optimization-config` is used to configure the graph optimization level, with the following available options: ++ `0`: Use Dynamic compute graph, default to 0 ++ `1`: Use Static compute graph, during the initialization phase, Paddle API will be used to convert the dynamic image into a static image ++ `2`: Base on Static compute graph, use the complier(CINN, Compiler Infrastructure for Neural Networks) of Paddle to compile and optimize + +In general, static graphs have lower Kernel Launch overhead than dynamic graphs, and it is recommended to use static graphs. +For adapted models, FastDeploy's CudaGraph *can support both dynamic and static graphs* simultaneously. + +When CudaGraph is enabled in the default configuration, a list of Batch Sizes that CudaGraph needs to capture will be automatically set based on the 'max_num_deqs' parameter. The logic for generating the list of Batch Sizes that need to be captured is as follows: + +1. Generate a candidate list with a range of [1,1024] Batch Size. + +``` + # Batch Size [1, 2, 4, 8, 16, ... 120, 128] + candidate_capture_sizes = [1, 2, 4] + [8 * i for i in range(1, 17)] + # Batch Size (128, 144, ... 240, 256] + candidate_capture_sizes += [16 * i for i in range(9, 17)] + # Batch Size (256, 288, ... 992, 1024] + candidate_capture_sizes += [32 * i for i in range(17, 33)] +``` + +2. Crop the candidate list based on the user set 'max_num_deqs' to obtain a CudaGraph capture list with a range of [1,' max_num_deqs']. + +Users can also customize the batch size list that needs to be captured by CudaGraph through the parameter `cudagraph_capture_sizes` in`--graph-optimization-config`: + +``` +--graph-optimization-config '{"cudagraph_capture_sizes": [1, 3, 5, 7, 9]}' +``` + +### 2.1 CudaGraph related parameters + + Using CudaGraph incurs some additional memory overhead, divided into two categories in FastDeploy: ++ Additional input Buffer overhead ++ CudaGraph uses dedicated memory pool, thus holding some intermediate activation memory isolated from main framework + +FastDeploy initialization sequence first uses `gpu_memory_utilization` parameter to calculate available memory for `KVCache`, after initializing `KVCache` then uses remaining memory to initialize CudaGraph. Since CudaGraph is not enabled by default currently, using default startup parameters may encounter `Out of memory` errors, can try following solutions: ++ Lower `gpu_memory_utilization` value, reserve more memory for CudaGraph. ++ Lower `max_num_seqs` to decrease the maximum concurrency. ++ Customize the batch size list that CudaGraph needs to capture through `graph_optimization_config`, and reduce the number of captured graphs by using `cudagraph_capture_sizes` + ++ Before use, must ensure loaded model is properly decorated with ```@support_graph_optimization```. + + ```python + # 1. import decorator + from fastdeploy.model_executor.graph_optimization.decorator import support_graph_optimization + ... + + # 2. add decorator + @support_graph_optimization + class Ernie4_5_Model(nn.Layer): # Note decorator is added to nn.Layer subclass + ... + + # 3. modify parameter passing in ModelForCasualLM subclass's self.model() + class Ernie4_5_MoeForCausalLM(ModelForCasualLM): + ... + def forward( + self, + ids_remove_padding: paddle.Tensor, + forward_meta: ForwardMeta, + ): + hidden_states = self.model(ids_remove_padding=ids_remove_padding, # specify parameter name when passing + forward_meta=forward_meta) + return hidden_statesfrom fastdeploy.model_executor.graph_optimization.decorator import support_graph_optimization + ... + + @support_graph_optimization + class Ernie45TModel(nn.Layer): # Note decorator is added to nn.Layer subclass + ... + ``` diff --git a/docs/features/images/GraphOptBackendArch.svg b/docs/features/images/GraphOptBackendArch.svg new file mode 100644 index 0000000000..4a599bd024 --- /dev/null +++ b/docs/features/images/GraphOptBackendArch.svg @@ -0,0 +1 @@ +
Dynamic to Static
Dynamic to Static
Dynamic Graph
Dynamic Graph
Static Full Graph
Static Full Graph
Static Graph
Static Graph
SubGraphs
SubGraphs
Attention Layers
Attention Layers
Modular Networking
Modular Networking
Graph Caputre
Graph Caputre
Cuda Graphs
Cuda Graphs
Graph Capture Stage
Graph Capture Stage
Graph Replay Stage
Graph Replay Stage
Updates Kernel parameters
Updates Kernel parame...
Padding Inputs
Padding Inputs
CUDA Graph
Supports dynamic parameters
CUDA Graph...
Graphs
Replay
Graphs...
Full Graph
Replay
Full Graph...
Attention.forward()
Attention.forward()
Replay
Replay
SubGrpah
Replay
SubGrpah...
Split Graph
Split Graph
Static/Dynamic SubGraphs
Static/Dynamic SubGraphs
Dynamic Full Graph
Dynamic Full Graph
CINN
CINN
Text is not SVG - cannot display
diff --git a/docs/parameters.md b/docs/parameters.md index 28a66b72c9..f302fbe420 100644 --- a/docs/parameters.md +++ b/docs/parameters.md @@ -35,8 +35,8 @@ When using FastDeploy to deploy models (including offline inference and service | ```long_prefill_token_threshold``` | `int` | When Chunked Prefill is enabled, requests with token count exceeding this value are considered long requests, default: max_model_len*0.04 | | ```static_decode_blocks``` | `int` | During inference, each request is forced to allocate corresponding number of blocks from Prefill's KVCache for Decode use, default: 2 | | ```reasoning_parser``` | `str` | Specify the reasoning parser to extract reasoning content from model output | -| ```use_cudagraph``` | `bool` | Whether to use cuda graph, default: False | -|```graph_optimization_config``` | `str` | Parameters related to graph optimization can be configured, with default values of'{"use_cudagraph":false, "graph_opt_level":0, "cudagraph_capture_sizes": null }' | +| ```use_cudagraph``` | `bool` | Whether to use cuda graph, default False. It is recommended to read [graph_optimization.md](./features/graph_optimization.md) carefully before opening. Custom all-reduce needs to be enabled at the same time in multi-card scenarios. | +| ```graph_optimization_config``` | `dict[str]` | Can configure parameters related to calculation graph optimization, the default value is'{"use_cudagraph":false, "graph_opt_level":0, "cudagraph_capture_sizes": null }',Detailed description reference [graph_optimization.md](./features/graph_optimization.md)| | ```enable_custom_all_reduce``` | `bool` | Enable Custom all-reduce, default: False | | ```splitwise_role``` | `str` | Whether to enable splitwise inference, default value: mixed, supported parameters: ["mixed", "decode", "prefill"] | | ```innode_prefill_ports``` | `str` | Internal engine startup ports for prefill instances (only required for single-machine PD separation), default: None | @@ -70,86 +70,3 @@ In actual inference, it's difficult for users to know how to properly configure When `enable_chunked_prefill` is enabled, the service processes long input sequences through dynamic chunking, significantly improving GPU resource utilization. In this mode, the original `max_num_batched_tokens` parameter no longer constrains the batch token count in prefill phase (limiting single prefill token count), thus introducing `max_num_partial_prefills` parameter specifically to limit concurrently processed partial batches. To optimize scheduling priority for short requests, new `max_long_partial_prefills` and `long_prefill_token_threshold` parameter combination is added. The former limits the number of long requests in single prefill batch, the latter defines the token threshold for long requests. The system will prioritize batch space for short requests, thereby reducing short request latency in mixed workload scenarios while maintaining stable throughput. - -## 4. GraphOptimizationBackend related configuration parameters -Currently, only user configuration of the following parameters is supported: -- `use_cudagraph` : bool = False -- `graph_optimization_config` : Dict[str, Any] - - `graph_opt_level`: int = 0 - - `use_cudagraph`: bool = False - - `cudagraph_capture_sizes` : List[int] = None - -CudaGrpah can be enabled by setting `--use-cudagraph` or `--graph-optimization-config '{"use_cudagraph":true}'`. Using two different methods to set the use graph simultaneously may cause conflicts. - -The `graph_opt_level` parameter within `--graph-optimization-config` is used to configure the graph optimization level, with the following available options: -- `0`: Use Dynamic compute graph, default to 0 -- `1`: Use Static compute graph, during the initialization phase, Paddle API will be used to convert the dynamic image into a static image -- `2`: Base on Static compute graph, use the complier(CINN, Compiler Infrastructure for Neural Networks) of Paddle to compile and optimize - -In general, static graphs have lower Kernel Launch overhead than dynamic graphs, and it is recommended to use static graphs. -For adapted models, FastDeploy's CudaGraph *can support both dynamic and static graphs* simultaneously. - -When CudaGraph is enabled in the default configuration, a list of Batch Sizes that CudaGraph needs to capture will be automatically set based on the 'max_num_deqs' parameter. The logic for generating the list of Batch Sizes that need to be captured is as follows: - -1. Generate a candidate list with a range of [1,1024] Batch Size. - -``` - # Batch Size [1, 2, 4, 8, 16, ... 120, 128] - candidate_capture_sizes = [1, 2, 4] + [8 * i for i in range(1, 17)] - # Batch Size (128, 144, ... 240, 256] - candidate_capture_sizes += [16 * i for i in range(9, 17)] - # Batch Size (256, 288, ... 992, 1024] - candidate_capture_sizes += [32 * i for i in range(17, 33)] -``` - -2. Crop the candidate list based on the user set 'max_num_deqs' to obtain a CudaGraph capture list with a range of [1,' max_num_deqs']. - -Users can also customize the batch size list that needs to be captured by CudaGraph through the parameter `cudagraph_capture_sizes` in`--graph-optimization-config`: - -``` ---graph-optimization-config '{"cudagraph_capture_sizes": [1, 3, 5, 7, 9]}' -``` - -### CudaGraph related parameters - - Using CudaGraph incurs some additional memory overhead, divided into two categories in FastDeploy: -- Additional input Buffer overhead -- CudaGraph uses dedicated memory pool, thus holding some intermediate activation memory isolated from main framework - -FastDeploy initialization sequence first uses `gpu_memory_utilization` parameter to calculate available memory for `KVCache`, after initializing `KVCache` then uses remaining memory to initialize CudaGraph. Since CudaGraph is not enabled by default currently, using default startup parameters may encounter `Out of memory` errors, can try following solutions: -- Lower `gpu_memory_utilization` value, reserve more memory for CudaGraph. -- Lower `max_num_seqs` to decrease the maximum concurrency. -- Customize the batch size list that CudaGraph needs to capture through `graph_optimization_config`, and reduce the number of captured graphs by using `cudagraph_capture_sizes` - -- Before use, must ensure loaded model is properly decorated with ```@support_graph_optimization```. - - ```python - # 1. import decorator - from fastdeploy.model_executor.graph_optimization.decorator import support_graph_optimization - ... - - # 2. add decorator - @support_graph_optimization - class Ernie4_5_Model(nn.Layer): # Note decorator is added to nn.Layer subclass - ... - - # 3. modify parameter passing in ModelForCasualLM subclass's self.model() - class Ernie4_5_MoeForCausalLM(ModelForCasualLM): - ... - def forward( - self, - ids_remove_padding: paddle.Tensor, - forward_meta: ForwardMeta, - ): - hidden_states = self.model(ids_remove_padding=ids_remove_padding, # specify parameter name when passing - forward_meta=forward_meta) - return hidden_statesfrom fastdeploy.model_executor.graph_optimization.decorator import support_graph_optimization - ... - - @support_graph_optimization - class Ernie45TModel(nn.Layer): # Note decorator is added to nn.Layer subclass - ... - ``` - -- When ```use_cudagraph``` is enabled, currently only supports single-GPU inference, i.e. ```tensor_parallel_size``` set to 1. -- When ```use_cudagraph``` is enabled, cannot enable ```enable_prefix_caching``` or ```enable_chunked_prefill```. diff --git a/docs/zh/best_practices/ERNIE-4.5-0.3B-Paddle.md b/docs/zh/best_practices/ERNIE-4.5-0.3B-Paddle.md index 4b9eb3343a..761ec3a303 100644 --- a/docs/zh/best_practices/ERNIE-4.5-0.3B-Paddle.md +++ b/docs/zh/best_practices/ERNIE-4.5-0.3B-Paddle.md @@ -76,7 +76,7 @@ CUDAGraph 是 NVIDIA 提供的一项 GPU 计算加速技术,通过将 CUDA 操 --use-cudagraph ``` 注: -1. 通常情况下不需要额外设置其他参数,但CUDAGraph会产生一些额外的显存开销,在一些显存受限的场景下可能需要调整。详细的参数调整请参考[GraphOptimizationBackend](../parameters.md) 相关配置参数说明 +1. 通常情况下不需要额外设置其他参数,但CUDAGraph会产生一些额外的显存开销,在一些显存受限的场景下可能需要调整。详细的参数调整请参考[GraphOptimizationBackend](../features/graph_optimization.md) 相关配置参数说明 2. 开启CUDAGraph时,如果是TP>1的多卡推理场景,需要同时指定 `--enable-custom-all-reduce` 3. 开启CUDAGraph时,暂时不支持`max-model-len > 32768`的场景。 diff --git a/docs/zh/best_practices/ERNIE-4.5-21B-A3B-Paddle.md b/docs/zh/best_practices/ERNIE-4.5-21B-A3B-Paddle.md index db4985cc75..efe6f3cba7 100644 --- a/docs/zh/best_practices/ERNIE-4.5-21B-A3B-Paddle.md +++ b/docs/zh/best_practices/ERNIE-4.5-21B-A3B-Paddle.md @@ -86,7 +86,7 @@ CUDAGraph 是 NVIDIA 提供的一项 GPU 计算加速技术,通过将 CUDA 操 --use-cudagraph ``` 注: -1. 通常情况下不需要额外设置其他参数,但CUDAGraph会产生一些额外的显存开销,在一些显存受限的场景下可能需要调整。详细的参数调整请参考[GraphOptimizationBackend](../parameters.md) 相关配置参数说明 +1. 通常情况下不需要额外设置其他参数,但CUDAGraph会产生一些额外的显存开销,在一些显存受限的场景下可能需要调整。详细的参数调整请参考[GraphOptimizationBackend](../features/graph_optimization.md) 相关配置参数说明 2. 开启CUDAGraph时,如果是TP>1的多卡推理场景,需要同时指定 `--enable-custom-all-reduce` 3. 开启CUDAGraph时,暂时不支持`max-model-len > 32768`的场景。 diff --git a/docs/zh/best_practices/ERNIE-4.5-300B-A47B-Paddle.md b/docs/zh/best_practices/ERNIE-4.5-300B-A47B-Paddle.md index b6ef33a19a..cbe4ae7279 100644 --- a/docs/zh/best_practices/ERNIE-4.5-300B-A47B-Paddle.md +++ b/docs/zh/best_practices/ERNIE-4.5-300B-A47B-Paddle.md @@ -136,7 +136,7 @@ CUDAGraph 是 NVIDIA 提供的一项 GPU 计算加速技术,通过将 CUDA 操 --enable-custom-all-reduce ``` 注: -1. 通常情况下不需要额外设置其他参数,但CUDAGraph会产生一些额外的显存开销,在一些显存受限的场景下可能需要调整。详细的参数调整请参考[GraphOptimizationBackend](../parameters.md) 相关配置参数说明 +1. 通常情况下不需要额外设置其他参数,但CUDAGraph会产生一些额外的显存开销,在一些显存受限的场景下可能需要调整。详细的参数调整请参考[GraphOptimizationBackend](../features/graph_optimization.md) 相关配置参数说明 2. 开启CUDAGraph时,如果是TP>1的多卡推理场景,需要同时指定 `--enable-custom-all-reduce` 3. 开启CUDAGraph时,暂时不支持`max-model-len > 32768`的场景。 diff --git a/docs/zh/features/graph_optimization.md b/docs/zh/features/graph_optimization.md new file mode 100644 index 0000000000..0b6ca21d7c --- /dev/null +++ b/docs/zh/features/graph_optimization.md @@ -0,0 +1,119 @@ +# FastDeploy 中的图优化技术 +FastDeploy 的 `GraphOptimizationBackend` 中集成了多种图优化技术: + ++ **CUDA Graph**:一种通过单个 CPU 操作启动多个 GPU 操作的机制,可以降低开销并提高性能 + ++ **动态图转静态图**:将动态图转换为静态图,利用全局图结构信息优化计算图、提升执行效率 + ++ **CINN 神经网络编译器**:在静态图的基础上执行 IR 转换、Kernel 融合、Kernel 生成等计算图编译优化方法,实现综合优化 + +任何依赖数据的控制流、Host-Device 同步、地址/形状变化的模型输入、动态的 Kernel 执行配置等动态情况都会导致 CUDAGraph Capture/Replay 失败,而大模型推理中面临场景的是动态的输入长度、动态的 Batch Size,灵活的 Attention 实现和多卡通信,导致 CUDA Graph 难以应用。 + +开源主流方案基于静态图实现 CUDA Graph,技术栈较深。FastDeploy 不仅支持静态图、神经网络编译器、CUDAGraph 组合优化,还支持直接在动态图中应用 CUDA Graph ,开发成本更低,但面临的动态情况更复杂。 + +FastDeploy 的 `GraphOptimizationBackend` 设计架构如下,**部分功能仍在开发中,建议仔细阅读第一章节使用限制**。 + +![](./images/GraphOptBackendArch.svg) + +## 1. GraphOptimizationBackend 当前使用限制 +### 1.1 多卡场景需要开启 Custom all-reduce +在 CUDAGraph 多卡推理任务中需要使用 Custom all-reduce 算子进行多卡 all-reduce, + +在 2.2 版本之前,CUDAGraph 和 Custom all-reduce 算子都未默认开启,需要在启动命令中添加 `--enable-custom-all-reduce` 手动开启。 + +### 1.2 FLAGS_max_partition_size 相关的 Kernel 的动态执行配置导致 CUDAGraph 执行失败 +`FLAGS_max_partition_size` 环境变量控制了 CascadeAppend Attention 中 Kernel 的`gridDim` 执行配置 , 而动态的执行配置会导致 CUDAGraph 执行失败。 + +[PR#3223](https://github.com/PaddlePaddle/FastDeploy/pull/3223) 修复了这个问题,但在 2.2 之前的 Release 版本依然存在这个问题。 + +**问题自查方法:** ++ 根据`FLAGS_max_partition_size`的值(默认是 32K)和启动参数中的 `max_model_len`计算`div_up(max_model_len, max_partition_size)`,结果大于`1`时无法执行,等于`1`时可以正常运行 + +**解决方法:** + 1. 调整`FLAGS_max_partition_size`和`max_model_len`的值,不触发动态执行配置。 + 2. 关闭 CUDAGraph + +## 2. GraphOptimizationBackend 相关配置参数说明 + +当前仅支持用户配置以下参数: + ++ `use_cudagraph` : bool = False ++ `graph_optimization_config` : Dict[str, Any] + + `graph_opt_level`: int = 0 + + `use_cudagraph`: bool = False + + `cudagraph_capture_sizes` : List[int] = None + +可以通过设置 `--use-cudagraph` 或 `--graph-optimization-config '{"use_cudagraph":true}'` 开启 CudaGrpah。 + +`--graph-optimization-config` 中的 `graph_opt_level` 参数用于配置图优化等级,可选项如下: + ++ `0`: 动态图,默认为 0 ++ `1`: 静态图,初始化阶段会使用 Paddle API 将动态图转换为静态图 ++ `2`: 在静态图的基础上,使用 Paddle 框架编译器(CINN, Compiler Infrastructure for Neural Networks)进行编译优化 + +一般情况下静态图比动态图的 Kernel Launch 开销更小,推荐使用静态图。 +对于已适配的模型,FastDeploy 的 CudaGraph **可同时支持动态图与静态图**。 + +在默认配置下开启 CudaGraph 时,会根据 `max_num_seqs` 参数自动设置 CudaGraph 需要捕获的 Batch Size 列表,需要捕获的 Batch Size 的列表自动生成逻辑如下: + +1. 生成一个范围为 [1,1024] Batch Size 的候选列表 + +``` + # Batch Size [1, 2, 4, 8, 16, ... 120, 128] + candidate_capture_sizes = [1, 2, 4] + [8 * i for i in range(1, 17)] + # Batch Size (128, 144, ... 240, 256] + candidate_capture_sizes += [16 * i for i in range(9, 17)] + # Batch Size (256, 288, ... 992, 1024] + candidate_capture_sizes += [32 * i for i in range(17, 33)] +``` + +2. 根据用户设置的 `max_num_seqs` 裁剪候选列表,得到范围为 [1, `max_num_seqs`] 的 CudaGraph 捕获列表。 + +用户也可以通过 `--graph-optimization-config` 中的 `cudagraph_capture_sizes` 参数自定义需要被 CudaGraph 捕获的 Batch Size 列表: + +``` +--graph-optimization-config '{"cudagraph_capture_sizes": [1, 3, 5, 7, 9]}' +``` + +### 2.1 CudaGraph相关参数说明 + +使用 CudaGraph 会产生一些额外的显存开销,在FastDeploy中分为下面两类: + ++ 额外的输入 Buffer 开销 ++ CudaGraph 使用了专用的显存池,因此会持有一部分与主框架隔离的中间激活显存 + +FastDeploy 的初始化顺序为先使用 `gpu_memory_utilization` 参数计算 `KVCache` 可用的显存,初始化完 `KVCache` 之后才会使用剩余显存初始化 CudaGraph。由于 CudaGraph 目前还不是默认开启的,因此使用默认启动参数可能会遇到 `Out Of Memory` 错误,可以尝试使用下面三种方式解决: + ++ 调低 `gpu_memory_utilization` 的值,多预留一些显存给CudaGraph使用。 ++ 调低 `max_num_seqs` 的值,降低最大并发数。 ++ 通过 `graph_optimization_config` 自定义需要 CudaGraph 捕获的 Batch Size 列表 `cudagraph_capture_sizes`,减少捕获的图的数量 + +使用CudaGraph之前,需要确保加载的模型被装饰器 ``@support_graph_optimization``正确修饰。 + +```python + # 1. import 装饰器 + from fastdeploy.model_executor.graph_optimization.decorator import support_graph_optimization + ... + + # 2. 添加装饰器 + @support_graph_optimization + class Ernie4_5_Model(nn.Layer): # 注意 decorator 加在 nn.Layer 的子类上 + ... + + # 3. 修改 ModelForCasualLM 子类中 self.model() 的传参方式 + class Ernie4_5_MoeForCausalLM(ModelForCasualLM): + ... + def forward( + self, + ids_remove_padding: paddle.Tensor, + forward_meta: ForwardMeta, + ): + hidden_states = self.model(ids_remove_padding=ids_remove_padding, # 传参时指定参数名 + forward_meta=forward_meta) + return hidden_statesfrom fastdeploy.model_executor.graph_optimization.decorator import support_graph_optimization + ... + + @support_graph_optimization + class Ernie45TModel(nn.Layer): # 注意 decorator 加在 nn.Layer 的子类上 + ... +``` diff --git a/docs/zh/features/images/GraphOptBackendArch.svg b/docs/zh/features/images/GraphOptBackendArch.svg new file mode 100644 index 0000000000..4a599bd024 --- /dev/null +++ b/docs/zh/features/images/GraphOptBackendArch.svg @@ -0,0 +1 @@ +
Dynamic to Static
Dynamic to Static
Dynamic Graph
Dynamic Graph
Static Full Graph
Static Full Graph
Static Graph
Static Graph
SubGraphs
SubGraphs
Attention Layers
Attention Layers
Modular Networking
Modular Networking
Graph Caputre
Graph Caputre
Cuda Graphs
Cuda Graphs
Graph Capture Stage
Graph Capture Stage
Graph Replay Stage
Graph Replay Stage
Updates Kernel parameters
Updates Kernel parame...
Padding Inputs
Padding Inputs
CUDA Graph
Supports dynamic parameters
CUDA Graph...
Graphs
Replay
Graphs...
Full Graph
Replay
Full Graph...
Attention.forward()
Attention.forward()
Replay
Replay
SubGrpah
Replay
SubGrpah...
Split Graph
Split Graph
Static/Dynamic SubGraphs
Static/Dynamic SubGraphs
Dynamic Full Graph
Dynamic Full Graph
CINN
CINN
Text is not SVG - cannot display
diff --git a/docs/zh/parameters.md b/docs/zh/parameters.md index 177a2d97bd..09ba05d603 100644 --- a/docs/zh/parameters.md +++ b/docs/zh/parameters.md @@ -33,8 +33,8 @@ | ```long_prefill_token_threshold``` | `int` | 开启Chunked Prefill时,请求Token数超过此值的请求被视为长请求,默认为max_model_len*0.04 | | ```static_decode_blocks``` | `int` | 推理过程中,每条请求强制从Prefill的KVCache分配对应块数给Decode使用,默认2| | ```reasoning_parser``` | `str` | 指定要使用的推理解析器,以便从模型输出中提取推理内容 | -| ```use_cudagraph``` | `bool` | 是否使用cuda graph,默认False | -|```graph_optimization_config``` | `str` | 可以配置计算图优化相关的参数,默认值为'{"use_cudagraph":false, "graph_opt_level":0, "cudagraph_capture_sizes": null }' | +| ```use_cudagraph``` | `bool` | 是否使用cuda graph,默认False。开启前建议仔细阅读 [graph_optimization.md](./features/graph_optimization.md),在多卡场景需要同时开启 Custom all-reduce。 | +| ```graph_optimization_config``` | `dict[str]` | 可以配置计算图优化相关的参数,默认值为'{"use_cudagraph":false, "graph_opt_level":0, "cudagraph_capture_sizes": null }',详细说明参考 [graph_optimization.md](./features/graph_optimization.md)| | ```enable_custom_all_reduce``` | `bool` | 开启Custom all-reduce,默认False | | ```splitwise_role``` | `str` | 是否开启splitwise推理,默认值mixed, 支持参数为["mixed", "decode", "prefill"] | | ```innode_prefill_ports``` | `str` | prefill 实例内部引擎启动端口 (仅单机PD分离需要),默认值None | @@ -67,84 +67,3 @@ FastDeploy在推理过程中,显存被```模型权重```、```预分配KVCache 当启用 `enable_chunked_prefill` 时,服务通过动态分块处理长输入序列,显著提升GPU资源利用率。在此模式下,原有 `max_num_batched_tokens` 参数不再约束预填充阶段的批处理token数量(限制单次prefill的token数量),因此引入 `max_num_partial_prefills` 参数,专门用于限制同时处理的分块批次数。 为优化短请求的调度优先级,新增 `max_long_partial_prefills` 与 `long_prefill_token_threshold` 参数组合。前者限制单个预填充批次中的长请求数量,后者定义长请求的token阈值。系统会优先保障短请求的批处理空间,从而在混合负载场景下降低短请求延迟,同时保持整体吞吐稳定。 - -## 4. GraphOptimizationBackend 相关配置参数说明 -当前仅支持用户配置以下参数: -- `use_cudagraph` : bool = False -- `graph_optimization_config` : Dict[str, Any] - - `graph_opt_level`: int = 0 - - `use_cudagraph`: bool = False - - `cudagraph_capture_sizes` : List[int] = None - -可以通过设置 `--use-cudagraph` 或 `--graph-optimization-config '{"use_cudagraph":true}'` 开启 CudaGrpah。 - -`--graph-optimization-config` 中的 `graph_opt_level` 参数用于配置图优化等级,可选项如下: -- `0`: 动态图,默认为 0 -- `1`: 静态图,初始化阶段会使用 Paddle API 将动态图转换为静态图 -- `2`: 在静态图的基础上,使用 Paddle 框架编译器(CINN, Compiler Infrastructure for Neural Networks)进行编译优化 - -一般情况下静态图比动态图的 Kernel Launch 开销更小,推荐使用静态图。 -对于已适配的模型,FastDeploy 的 CudaGraph **可同时支持动态图与静态图**。 - -在默认配置下开启 CudaGraph 时,会根据 `max_num_seqs` 参数自动设置 CudaGraph 需要捕获的 Batch Size 列表,需要捕获的 Batch Size 的列表自动生成逻辑如下: -1. 生成一个范围为 [1,1024] Batch Size 的候选列表 - -``` - # Batch Size [1, 2, 4, 8, 16, ... 120, 128] - candidate_capture_sizes = [1, 2, 4] + [8 * i for i in range(1, 17)] - # Batch Size (128, 144, ... 240, 256] - candidate_capture_sizes += [16 * i for i in range(9, 17)] - # Batch Size (256, 288, ... 992, 1024] - candidate_capture_sizes += [32 * i for i in range(17, 33)] -``` - -2. 根据用户设置的 `max_num_seqs` 裁剪候选列表,得到范围为 [1, `max_num_seqs`] 的 CudaGraph 捕获列表。 - -用户也可以通过 `--graph-optimization-config` 中的 `cudagraph_capture_sizes` 参数自定义需要被 CudaGraph 捕获的 Batch Size 列表: - -``` ---graph-optimization-config '{"cudagraph_capture_sizes": [1, 3, 5, 7, 9]}' -``` - -### CudaGraph相关参数说明 -使用 CudaGraph 会产生一些额外的显存开销,在FastDeploy中分为下面两类: -- 额外的输入 Buffer 开销 -- CudaGraph 使用了专用的显存池,因此会持有一部分与主框架隔离的中间激活显存 - -FastDeploy 的初始化顺序为先使用 `gpu_memory_utilization` 参数计算 `KVCache` 可用的显存,初始化完 `KVCache` 之后才会使用剩余显存初始化 CudaGraph。由于 CudaGraph 目前还不是默认开启的,因此使用默认启动参数可能会遇到 `Out Of Memory` 错误,可以尝试使用下面三种方式解决: -- 调低 `gpu_memory_utilization` 的值,多预留一些显存给CudaGraph使用。 -- 调低 `max_num_seqs` 的值,降低最大并发数。 -- 通过 `graph_optimization_config` 自定义需要 CudaGraph 捕获的 Batch Size 列表 `cudagraph_capture_sizes`,减少捕获的图的数量 - -使用CudaGraph之前,需要确保加载的模型被装饰器 ```@support_graph_optimization```正确修饰。 - - ```python - # 1. import 装饰器 - from fastdeploy.model_executor.graph_optimization.decorator import support_graph_optimization - ... - - # 2. 添加装饰器 - @support_graph_optimization - class Ernie4_5_Model(nn.Layer): # 注意 decorator 加在 nn.Layer 的子类上 - ... - - # 3. 修改 ModelForCasualLM 子类中 self.model() 的传参方式 - class Ernie4_5_MoeForCausalLM(ModelForCasualLM): - ... - def forward( - self, - ids_remove_padding: paddle.Tensor, - forward_meta: ForwardMeta, - ): - hidden_states = self.model(ids_remove_padding=ids_remove_padding, # 传参时指定参数名 - forward_meta=forward_meta) - return hidden_statesfrom fastdeploy.model_executor.graph_optimization.decorator import support_graph_optimization - ... - - @support_graph_optimization - class Ernie45TModel(nn.Layer): # 注意 decorator 加在 nn.Layer 的子类上 - ... - ``` - -- 当开启 ```use_cudagraph``` 时,暂时只支持单卡推理,即 ```tensor_parallel_size``` 设为1。 -- 当开启 ```use_cudagraph``` 时,暂不支持开启 ```enable_prefix_caching``` 或 ```enable_chunked_prefill``` 。 diff --git a/mkdocs.yml b/mkdocs.yml index 443659f6d1..297e8ec97b 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -36,10 +36,11 @@ plugins: name: English site_name: 'FastDeploy: Large Language Model Deployement' build: true + link: /FastDeploy/ - locale: zh name: 简体中文 site_name: 飞桨大语言模型推理部署工具包 - link: /./zh/ + link: /FastDeploy/zh/ nav_translations: FastDeploy: FastDeploy Quick Start: 快速入门 @@ -81,6 +82,7 @@ plugins: Plugins: 插件机制 Sampling: 采样策略 MultiNode Deployment: 多机部署 + Graph Optimization: 图优化 Supported Models: 支持模型列表 Benchmark: 基准测试 Usage: 用法 @@ -129,6 +131,7 @@ nav: - 'Plugins': features/plugins.md - 'Sampling': features/sampling.md - 'MultiNode Deployment': features/multi-node_deployment.md + - 'Graph Optimization': features/graph_optimization.md - 'Supported Models': supported_models.md - Benchmark: benchmark.md - Usage: