NVIDIA · Shixiaowei02 · Aug 11, 2025 · Aug 24, 2025 · Aug 25, 2025 · coderabbitai
@@ -336,15 +336,15 @@ cd cpp/build
 `disaggServerBenchmark` only supports `decoder-only` models.
 Here is the basic usage:
 ```
-export TRTLLM_USE_MPI_KVCACHE=1
+export TRTLLM_USE_UCX_KVCACHE=1
 mpirun -n ${proc} benchmarks/disaggServerBenchmark --context_engine_dirs ${context_engine_0},${context_engine_1}...,${context_engine_{m-1}} \
 --generation_engine_dirs ${generation_engine_0},${generation_engine_1}...,${generation_engine_{n-1}} --dataset ${dataset_path}
 ```
 This command will launch m context engines and n generation engines. You need to ensure `proc` is equal to the sum of the number of processes required for each engine plus 1. Since we use orchestrator mode for `disaggServerBenchmark` we need an additional process as the orchestrator. For example, if there are two context engines (one is TP2_PP1,another is TP1_PP1) and two generation engines(one is TP2_PP1,another is TP1_PP1), then the `proc` value should be set to 7.
 
 for example:
 ```
-export TRTLLM_USE_MPI_KVCACHE=1
+export TRTLLM_USE_UCX_KVCACHE=1
 mpirun -n 7 benchmarks/disaggServerBenchmark --context_engine_dirs ${llama_7b_tp2_pp1_dir},${llama_7b_tp1_pp1_dir} --generation_engine_dirs ${llama_7b_tp1_pp1_dir},${llama_7b_tp2_pp1_dir} --dataset ${dataset_path}
 
 # need 6 gpus and 7 processes to launch the benchmark.

@@ -66,17 +66,6 @@ A. Yes, it's recommended that different executor use different GPUs . We support
 
 ### Debugging FAQs
 
-*Q. How to handle error `Disaggregated serving is not enabled, please check the configuration?`*
-
-A. please set `backendType` of `CacheTransceiverConfig`.
-```cpp
-ExecutorConfig executorConfig{...};
-
-executorConfig.setCacheTransceiverConfig(texec::CacheTransceiverConfig(BackendType::DEFAULT));
-```
-
-When the environment variable `TRTLLM_USE_MPI_KVCACHE=1` is set, TRT-LLM will transfer the KV cache using `CUDA-aware MPI`. All executor processes involved must share the same MPI world communicator. Consequently, with `TRTLLM_USE_MPI_KVCACHE=1`, TRT-LLM only supports launching multiple executors via `MPI`. Additionally, the `CommunicationMode` for the executors must be set to `kLEADER` or `kORCHESTRATOR` with `SpawnProcesses=false` for the `disaggregated-service`. These restrictions do not apply when `TRTLLM_USE_UCX_KVCACHE=1` is set.
-
 *Q. Does TRT-LLM support using GPU direct RDMA for inter-node KV Cache transfer?*
 
 A. Yes, TRT-LLM supports using GPU direct RDMA for inter-node KV cache transfer.

@@ -124,10 +124,10 @@ From the `examples/cpp/executor/build` folder, you can also run the `executorExa
 ```
 ./executorExampleDisaggregated -h
 ```
-Note setting `TRTLLM_USE_MPI_KVCACHE=1` is required to run disaggregated executor.
+Note setting `TRTLLM_USE_UCX_KVCACHE=1` is required to run disaggregated executor.
 For example, you can run :
 ```
-export TRTLLM_USE_MPI_KVCACHE=1
+export TRTLLM_USE_UCX_KVCACHE=1
 
 mpirun -n <num_ranks> --allow-run-as-root --oversubscribe ./executorExampleDisaggregated --context_engine_dir <path_to_context_engine_dir> --context_rank_size <num_ranks_for_context> --generation_engine_dir <path_to_generation_engine_dir> --generation_rank_size <num_ranks_for_generation> --input_tokens ../inputTokens.csv
 

@@ -12,24 +12,39 @@ cache_transceiver_config:
   max_tokens_in_buffer: <int>
 ```
 
-`backend` specifies the communication backend for transferring the kvCache, valid options include `DEFAULT`,`UCX`, `NIXL`, and `MPI`, the default backend is UCX.
+`backend` specifies the communication backend for transferring the KV cache, valid options include `DEFAULT`, `UCX`, `NIXL`, and `MPI`, the default backend is `UCX`.
 
-`max_tokens_in_buffer` defines the buffer size for kvCache transfers, it is recommended to set this value greater than or equal to the maximum ISL (Input Sequence Length) of all requests for optimal performance.
+`max_tokens_in_buffer` defines the buffer size for KV cache transfers, it is recommended to set this value greater than or equal to the maximum ISL (Input Sequence Length) of all requests for optimal performance.
 
-You can use multiple `trtllm-serve` commands to launch the context and generation servers that will be used
-for disaggregated serving. For example, you could launch two context servers and one generation servers as follows:
+You can use multiple `trtllm-serve` commands to launch the context and generation servers required for disaggregated serving. For instance, you might start two context servers and one generation server as shown below.
 
-```bash
-# Generate context_extra-llm-api-config.yml
-# Overlap scheduler for context servers are disabled because it's not supported for disaggregated context servers yet
-echo -e "disable_overlap_scheduler: True\ncache_transceiver_config:\n  backend: UCX\n  max_tokens_in_buffer: 2048" > context_extra-llm-api-config.yml
+Begin by creating `ctx_extra-llm-api-config.yml` and `gen_extra-llm-api-config.yml` following the specified format.
 
-# Start context servers
-CUDA_VISIBLE_DEVICES=0 trtllm-serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 --host localhost --port 8001  --extra_llm_api_options ./context_extra-llm-api-config.yml &> log_ctx_0 &
-CUDA_VISIBLE_DEVICES=1 trtllm-serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 --host localhost --port 8002  --extra_llm_api_options ./context_extra-llm-api-config.yml &> log_ctx_1 &
+```yaml
+# ctx_extra-llm-api-config.yml
+
+# The overlap scheduler for context servers is currently disabled, as it is
+# not yet supported in disaggregated context server architectures.
+disable_overlap_scheduler: True
+cache_transceiver_config:
+  backend: UCX
+  max_tokens_in_buffer: 2048
+```
 
-# Generate gen_extra-llm-api-config.yml
-echo -e "cache_transceiver_config:\n  backend: UCX\n  max_tokens_in_buffer: 2048" > gen_extra-llm-api-config.yml
+```yaml
+# gen_extra-llm-api-config.yml
+
+cache_transceiver_config:
+  backend: UCX
+  max_tokens_in_buffer: 2048
+```
+
+Then, start the context and generation servers separately.
+
+```bash
+# Start context servers
+CUDA_VISIBLE_DEVICES=0 trtllm-serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 --host localhost --port 8001  --extra_llm_api_options ./ctx_extra-llm-api-config.yml &> log_ctx_0 &
+CUDA_VISIBLE_DEVICES=1 trtllm-serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 --host localhost --port 8002  --extra_llm_api_options ./ctx_extra-llm-api-config.yml &> log_ctx_1 &
 
 # Start generation servers
 CUDA_VISIBLE_DEVICES=2 trtllm-serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 --host localhost --port 8003  --extra_llm_api_options ./gen_extra-llm-api-config.yml &> log_gen_0 &
@@ -95,8 +110,8 @@ After this, you can enable the dynamic scaling feature for the use case above as
 export TRTLLM_USE_UCX_KVCACHE=1
 
 # Context servers
-CUDA_VISIBLE_DEVICES=0 trtllm-serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 --host localhost --port 8001  --server_role CONTEXT --extra_llm_api_options ./context_extra-llm-api-config.yml --metadata_server_config_file ./metadata_config.yml &> log_ctx_0 &
-CUDA_VISIBLE_DEVICES=1 trtllm-serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 --host localhost --port 8002  --server_role CONTEXT --extra_llm_api_options ./context_extra-llm-api-config.yml --metadata_server_config_file ./metadata_config.yml &> log_ctx_1 &
+CUDA_VISIBLE_DEVICES=0 trtllm-serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 --host localhost --port 8001  --server_role CONTEXT --extra_llm_api_options ./ctx_extra-llm-api-config.yml --metadata_server_config_file ./metadata_config.yml &> log_ctx_0 &
+CUDA_VISIBLE_DEVICES=1 trtllm-serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 --host localhost --port 8002  --server_role CONTEXT --extra_llm_api_options ./ctx_extra-llm-api-config.yml --metadata_server_config_file ./metadata_config.yml &> log_ctx_1 &
 
 # Generation servers
 CUDA_VISIBLE_DEVICES=2 trtllm-serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 --host localhost --port 8003  --server_role GENERATION --extra_llm_api_options ./gen_extra-llm-api-config.yml --metadata_server_config_file ./metadata_config.yml &> log_gen_0 &
@@ -180,4 +195,4 @@ trtllm-serve disaggregated -c disagg_config.yaml
 
 ## Know Issues
 
-The MPI communication backend for kvCache transfer has been deprecated and may not be supported in the future. When using the MPI backend, the environment variable `TRTLLM_USE_MPI_KVCACHE=1` should be set to avoid conflicts between mpi4py and kvCache transfer.
+The MPI communication backend for KV cache transfer has been deprecated and may not be supported in the future. When using the MPI backend, the environment variable `TRTLLM_USE_MPI_KVCACHE=1` should be set to avoid conflicts between mpi4py and KV cache transfer.
@@ -11,14 +11,14 @@ context_servers:
   kv_cache_config:
     free_gpu_memory_fraction: 0.2
   cache_transceiver_config:
-    backend: "default"
+    backend: "DEFAULT"
   urls:
       - "localhost:8001"
 generation_servers:
   num_instances: 1
   tensor_parallel_size: 1
   pipeline_parallel_size: 1
   cache_transceiver_config:
-    backend: "default"
+    backend: "DEFAULT"
   urls:
       - "localhost:8002"
@@ -197,7 +197,7 @@ def gen_config_file(config_path: str,
             },
             'cache_transceiver_config': {
                 'max_tokens_in_buffer': cache_transceiver_max_num_tokens,
-                'backend': 'default',
+                'backend': 'DEFAULT',
             },
         },
         'generation_servers': {
@@ -225,7 +225,7 @@ def gen_config_file(config_path: str,
             },
             'cache_transceiver_config': {
                 'max_tokens_in_buffer': cache_transceiver_max_num_tokens,
-                'backend': 'default',
+                'backend': 'DEFAULT',
             },
             'stream_interval': 20,
         }

@@ -1039,7 +1039,7 @@ class CacheTransceiverConfig(StrictBaseModel, PybindMirror):
     Configuration for the cache transceiver.
     """
 
-    backend: Optional[Literal["default", "ucx", "nixl", "mpi"]] = Field(
+    backend: Optional[Literal["DEFAULT", "UCX", "NIXL", "MPI"]] = Field(
         default=None,
         description=
         "The communication backend type to use for the cache transceiver.")

@@ -260,7 +260,7 @@ def run_parallel_test(model_name: str, model_path: str, ctx_pp: int,
         "disable_overlap_scheduler": True,
         "kv_cache_config": kv_cache_config,
         "cache_transceiver_config": {
-            "backend": "default"
+            "backend": "DEFAULT"
         }
     }
     gen_server_config = {
@@ -269,7 +269,7 @@ def run_parallel_test(model_name: str, model_path: str, ctx_pp: int,
         "disable_overlap_scheduler": True,
         "kv_cache_config": kv_cache_config,
         "cache_transceiver_config": {
-            "backend": "default"
+            "backend": "DEFAULT"
         }
     }
 
@@ -309,8 +309,8 @@ def test_auto_dtype(self, disable_overlap_scheduler):
         gen_server_config = {
             "disable_overlap_scheduler": disable_overlap_scheduler
         }
-        ctx_server_config["cache_transceiver_config"] = {"backend": "default"}
-        gen_server_config["cache_transceiver_config"] = {"backend": "default"}
+        ctx_server_config["cache_transceiver_config"] = {"backend": "DEFAULT"}
+        gen_server_config["cache_transceiver_config"] = {"backend": "DEFAULT"}
         disaggregated_server_config = {
             "hostname": "localhost",
             "port": 8000,
@@ -351,15 +351,15 @@ def test_ngram(self):
             "disable_overlap_scheduler": True,
             "kv_cache_config": kv_cache_config,
             "cache_transceiver_config": {
-                "backend": "default"
+                "backend": "DEFAULT"
             }
         }
         gen_server_config = {
             "disable_overlap_scheduler": True,
             "speculative_config": speculative_decoding_config,
             "kv_cache_config": kv_cache_config,
             "cache_transceiver_config": {
-                "backend": "default"
+                "backend": "DEFAULT"
             }
         }
         disaggregated_server_config = {
@@ -404,7 +404,7 @@ def test_eagle3(self, overlap_scheduler, eagle3_one_model):
             "max_num_tokens": 13393 * 2,
             "max_batch_size": 1,
             "cache_transceiver_config": {
-                "backend": "default"
+                "backend": "DEFAULT"
             },
             "cuda_graph_config": None,
         }
@@ -418,7 +418,7 @@ def test_eagle3(self, overlap_scheduler, eagle3_one_model):
             "max_num_tokens": 13393 * 2,
             "max_batch_size": 16,
             "cache_transceiver_config": {
-                "backend": "default"
+                "backend": "DEFAULT"
             },
             "cuda_graph_config": None,
         }
@@ -472,8 +472,8 @@ class TestLlama4ScoutInstruct(LlmapiAccuracyTestHarness):
     def test_auto_dtype(self, overlap_scheduler):
         ctx_server_config = {"disable_overlap_scheduler": True}
         gen_server_config = {"disable_overlap_scheduler": overlap_scheduler}
-        ctx_server_config["cache_transceiver_config"] = {"backend": "default"}
-        gen_server_config["cache_transceiver_config"] = {"backend": "default"}
+        ctx_server_config["cache_transceiver_config"] = {"backend": "DEFAULT"}
+        gen_server_config["cache_transceiver_config"] = {"backend": "DEFAULT"}
         # Keep this low to avoid warmup OOM in CI
         ctx_server_config["max_seq_len"] = 8192
         gen_server_config["max_seq_len"] = 8192
@@ -513,13 +513,13 @@ def test_nixl_backend(self):
         ctx_server_config = {
             "disable_overlap_scheduler": True,
             "cache_transceiver_config": {
-                "backend": "nixl"
+                "backend": "NIXL"
             }
         }
         gen_server_config = {
             "disable_overlap_scheduler": True,
             "cache_transceiver_config": {
-                "backend": "nixl"
+                "backend": "NIXL"
             }
         }
         disaggregated_server_config = {
@@ -550,8 +550,8 @@ def test_nixl_backend(self):
     def test_auto_dtype(self, overlap_scheduler, mtp_nextn):
         ctx_server_config = {"disable_overlap_scheduler": True}
         gen_server_config = {"disable_overlap_scheduler": not overlap_scheduler}
-        ctx_server_config["cache_transceiver_config"] = {"backend": "default"}
-        gen_server_config["cache_transceiver_config"] = {"backend": "default"}
+        ctx_server_config["cache_transceiver_config"] = {"backend": "DEFAULT"}
+        gen_server_config["cache_transceiver_config"] = {"backend": "DEFAULT"}
         if mtp_nextn > 0:
             ctx_server_config["speculative_config"] = {
                 "decoding_type": "MTP",
@@ -597,14 +597,14 @@ def test_auto_dtype(self, overlap_scheduler):
             "disable_overlap_scheduler": True,
             "cuda_graph_config": None,
             "cache_transceiver_config": {
-                "backend": "default"
+                "backend": "DEFAULT"
             }
         }
         gen_server_config = {
             "disable_overlap_scheduler": overlap_scheduler,
             "cuda_graph_config": None,
             "cache_transceiver_config": {
-                "backend": "default"
+                "backend": "DEFAULT"
             }
         }
         ctx_server_config["kv_cache_config"] = {
@@ -648,13 +648,13 @@ def test_nixl_backend(self):
         ctx_server_config = {
             "disable_overlap_scheduler": True,
             "cache_transceiver_config": {
-                "backend": "nixl"
+                "backend": "NIXL"
             }
         }
         gen_server_config = {
             "disable_overlap_scheduler": True,
             "cache_transceiver_config": {
-                "backend": "nixl"
+                "backend": "NIXL"
             }
         }
         ctx_server_config["cache_transceiver_config"]
@@ -686,14 +686,14 @@ def test_auto_dtype(self, overlap_scheduler):
             "disable_overlap_scheduler": True,
             "cuda_graph_config": None,
             "cache_transceiver_config": {
-                "backend": "default"
+                "backend": "DEFAULT"
             }
         }
         gen_server_config = {
             "disable_overlap_scheduler": overlap_scheduler,
             "cuda_graph_config": None,
             "cache_transceiver_config": {
-                "backend": "default"
+                "backend": "DEFAULT"
             }
         }
         disaggregated_server_config = {

@@ -21,7 +21,7 @@ context_servers:
     event_buffer_max_size: 1024
     free_gpu_memory_fraction: 0.1
   cache_transceiver_config:
-    backend: default
+    backend: DEFAULT
   urls:
       - "localhost:8001"
       - "localhost:8002"
@@ -35,7 +35,7 @@ generation_servers:
   tensor_parallel_size: 1
   pipeline_parallel_size: 1
   cache_transceiver_config:
-    backend: default
+    backend: DEFAULT
   kv_cache_config:
     enable_block_reuse: True
     enable_partial_reuse: False

@@ -17,7 +17,7 @@ context_servers:
     event_buffer_max_size: 1024
     free_gpu_memory_fraction: 0.1
   cache_transceiver_config:
-    backend: "default"
+    backend: "DEFAULT"
   urls:
       - "localhost:8001"
       - "localhost:8002"
@@ -33,7 +33,7 @@ generation_servers:
     event_buffer_max_size: 1024
     free_gpu_memory_fraction: 0.1
   cache_transceiver_config:
-    backend: "default"
+    backend: "DEFAULT"
   urls:
       - "localhost:8003"
       - "localhost:8004"
@@ -15,7 +15,7 @@ context_servers:
     enable_partial_reuse: True
     event_buffer_max_size: 1024
   cache_transceiver_config:
-    backend: default
+    backend: DEFAULT
   urls:
       - "localhost:8001"
 generation_servers:
@@ -30,6 +30,6 @@ generation_servers:
     event_buffer_max_size: 1024
     free_gpu_memory_fraction: 0.05
   cache_transceiver_config:
-    backend: default
+    backend: DEFAULT
   urls:
       - "localhost:8002"
@@ -15,7 +15,7 @@ context_servers:
     enable_partial_reuse: True
     event_buffer_max_size: 1024
   cache_transceiver_config:
-    backend: default
+    backend: DEFAULT
   urls:
       - "localhost:8001"
 generation_servers:
@@ -30,6 +30,6 @@ generation_servers:
     event_buffer_max_size: 1024
     free_gpu_memory_fraction: 0.05
   cache_transceiver_config:
-    backend: default
+    backend: DEFAULT
   urls:
       - "localhost:8002"