Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions benchmarks/cpp/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -336,15 +336,15 @@ cd cpp/build
`disaggServerBenchmark` only supports `decoder-only` models.
Here is the basic usage:
```
export TRTLLM_USE_MPI_KVCACHE=1
export TRTLLM_USE_UCX_KVCACHE=1
mpirun -n ${proc} benchmarks/disaggServerBenchmark --context_engine_dirs ${context_engine_0},${context_engine_1}...,${context_engine_{m-1}} \
--generation_engine_dirs ${generation_engine_0},${generation_engine_1}...,${generation_engine_{n-1}} --dataset ${dataset_path}
```
This command will launch m context engines and n generation engines. You need to ensure `proc` is equal to the sum of the number of processes required for each engine plus 1. Since we use orchestrator mode for `disaggServerBenchmark` we need an additional process as the orchestrator. For example, if there are two context engines (one is TP2_PP1,another is TP1_PP1) and two generation engines(one is TP2_PP1,another is TP1_PP1), then the `proc` value should be set to 7.

for example:
```
export TRTLLM_USE_MPI_KVCACHE=1
export TRTLLM_USE_UCX_KVCACHE=1
mpirun -n 7 benchmarks/disaggServerBenchmark --context_engine_dirs ${llama_7b_tp2_pp1_dir},${llama_7b_tp1_pp1_dir} --generation_engine_dirs ${llama_7b_tp1_pp1_dir},${llama_7b_tp2_pp1_dir} --dataset ${dataset_path}

Comment on lines +339 to 349
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Switch to UCX env var looks correct

TRTLLM_USE_UCX_KVCACHE supersedes MPI variant in disaggServerBenchmark examples. Please ensure any internal tooling/docs refer to the new env var.

Quick scan for old variable:


🏁 Script executed:

#!/bin/bash
rg -n --no-heading -e 'TRTLLM_USE_MPI_KVCACHE'

Length of output: 1674


Ensure all references to the old MPI KV-cache env var are updated

I confirmed your switch to TRTLLM_USE_UCX_KVCACHE in benchmarks/cpp/README.md is correct, but there are still numerous references to the deprecated TRTLLM_USE_MPI_KVCACHE elsewhere in the repo. Please update these to use the new UCX-based variable for consistency:

• tensorrt_llm/_torch/pyexecutor/kv_cache_transceiver.py:44
• tests/integration/defs/triton_server/test.sh:1130, 1145
• tests/integration/defs/triton_server/test_triton_llm.py:3380
• tests/integration/defs/cpp/test_e2e.py:48
• tests/integration/defs/cpp/test_multi_gpu.py:35
• tests/integration/defs/disaggregated/test_disaggregated.py:706
• docker/common/install_mpi4py.sh:34, 47, 50
• examples/disaggregated/README.md:155, 198
• cpp/tensorrt_llm/common/envUtils.cpp:271

Let’s replace TRTLLM_USE_MPI_KVCACHE with TRTLLM_USE_UCX_KVCACHE in these locations (and any other internal tooling/docs) to complete the migration.

🧰 Tools
🪛 markdownlint-cli2 (0.17.2)

346-346: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

# need 6 gpus and 7 processes to launch the benchmark.
Expand Down
11 changes: 0 additions & 11 deletions docs/source/advanced/disaggregated-service.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,17 +66,6 @@ A. Yes, it's recommended that different executor use different GPUs . We support

### Debugging FAQs

*Q. How to handle error `Disaggregated serving is not enabled, please check the configuration?`*

A. please set `backendType` of `CacheTransceiverConfig`.
```cpp
ExecutorConfig executorConfig{...};

executorConfig.setCacheTransceiverConfig(texec::CacheTransceiverConfig(BackendType::DEFAULT));
```

When the environment variable `TRTLLM_USE_MPI_KVCACHE=1` is set, TRT-LLM will transfer the KV cache using `CUDA-aware MPI`. All executor processes involved must share the same MPI world communicator. Consequently, with `TRTLLM_USE_MPI_KVCACHE=1`, TRT-LLM only supports launching multiple executors via `MPI`. Additionally, the `CommunicationMode` for the executors must be set to `kLEADER` or `kORCHESTRATOR` with `SpawnProcesses=false` for the `disaggregated-service`. These restrictions do not apply when `TRTLLM_USE_UCX_KVCACHE=1` is set.

*Q. Does TRT-LLM support using GPU direct RDMA for inter-node KV Cache transfer?*

A. Yes, TRT-LLM supports using GPU direct RDMA for inter-node KV cache transfer.
Expand Down
4 changes: 2 additions & 2 deletions examples/cpp/executor/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -124,10 +124,10 @@ From the `examples/cpp/executor/build` folder, you can also run the `executorExa
```
./executorExampleDisaggregated -h
```
Note setting `TRTLLM_USE_MPI_KVCACHE=1` is required to run disaggregated executor.
Note setting `TRTLLM_USE_UCX_KVCACHE=1` is required to run disaggregated executor.
For example, you can run :
```
export TRTLLM_USE_MPI_KVCACHE=1
export TRTLLM_USE_UCX_KVCACHE=1

mpirun -n <num_ranks> --allow-run-as-root --oversubscribe ./executorExampleDisaggregated --context_engine_dir <path_to_context_engine_dir> --context_rank_size <num_ranks_for_context> --generation_engine_dir <path_to_generation_engine_dir> --generation_rank_size <num_ranks_for_generation> --input_tokens ../inputTokens.csv

Expand Down
47 changes: 31 additions & 16 deletions examples/disaggregated/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,24 +12,39 @@ cache_transceiver_config:
max_tokens_in_buffer: <int>
```

`backend` specifies the communication backend for transferring the kvCache, valid options include `DEFAULT`,`UCX`, `NIXL`, and `MPI`, the default backend is UCX.
`backend` specifies the communication backend for transferring the KV cache, valid options include `DEFAULT`, `UCX`, `NIXL`, and `MPI`, the default backend is `UCX`.

`max_tokens_in_buffer` defines the buffer size for kvCache transfers, it is recommended to set this value greater than or equal to the maximum ISL (Input Sequence Length) of all requests for optimal performance.
`max_tokens_in_buffer` defines the buffer size for KV cache transfers, it is recommended to set this value greater than or equal to the maximum ISL (Input Sequence Length) of all requests for optimal performance.

You can use multiple `trtllm-serve` commands to launch the context and generation servers that will be used
for disaggregated serving. For example, you could launch two context servers and one generation servers as follows:
You can use multiple `trtllm-serve` commands to launch the context and generation servers required for disaggregated serving. For instance, you might start two context servers and one generation server as shown below.

```bash
# Generate context_extra-llm-api-config.yml
# Overlap scheduler for context servers are disabled because it's not supported for disaggregated context servers yet
echo -e "disable_overlap_scheduler: True\ncache_transceiver_config:\n backend: UCX\n max_tokens_in_buffer: 2048" > context_extra-llm-api-config.yml
Begin by creating `ctx_extra-llm-api-config.yml` and `gen_extra-llm-api-config.yml` following the specified format.

# Start context servers
CUDA_VISIBLE_DEVICES=0 trtllm-serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 --host localhost --port 8001 --extra_llm_api_options ./context_extra-llm-api-config.yml &> log_ctx_0 &
CUDA_VISIBLE_DEVICES=1 trtllm-serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 --host localhost --port 8002 --extra_llm_api_options ./context_extra-llm-api-config.yml &> log_ctx_1 &
```yaml
# ctx_extra-llm-api-config.yml

# The overlap scheduler for context servers is currently disabled, as it is
# not yet supported in disaggregated context server architectures.
disable_overlap_scheduler: True
cache_transceiver_config:
backend: UCX
max_tokens_in_buffer: 2048
```

# Generate gen_extra-llm-api-config.yml
echo -e "cache_transceiver_config:\n backend: UCX\n max_tokens_in_buffer: 2048" > gen_extra-llm-api-config.yml
```yaml
# gen_extra-llm-api-config.yml

cache_transceiver_config:
backend: UCX
max_tokens_in_buffer: 2048
```

Then, start the context and generation servers separately.

```bash
# Start context servers
CUDA_VISIBLE_DEVICES=0 trtllm-serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 --host localhost --port 8001 --extra_llm_api_options ./ctx_extra-llm-api-config.yml &> log_ctx_0 &
CUDA_VISIBLE_DEVICES=1 trtllm-serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 --host localhost --port 8002 --extra_llm_api_options ./ctx_extra-llm-api-config.yml &> log_ctx_1 &

# Start generation servers
CUDA_VISIBLE_DEVICES=2 trtllm-serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 --host localhost --port 8003 --extra_llm_api_options ./gen_extra-llm-api-config.yml &> log_gen_0 &
Expand Down Expand Up @@ -95,8 +110,8 @@ After this, you can enable the dynamic scaling feature for the use case above as
export TRTLLM_USE_UCX_KVCACHE=1

# Context servers
CUDA_VISIBLE_DEVICES=0 trtllm-serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 --host localhost --port 8001 --server_role CONTEXT --extra_llm_api_options ./context_extra-llm-api-config.yml --metadata_server_config_file ./metadata_config.yml &> log_ctx_0 &
CUDA_VISIBLE_DEVICES=1 trtllm-serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 --host localhost --port 8002 --server_role CONTEXT --extra_llm_api_options ./context_extra-llm-api-config.yml --metadata_server_config_file ./metadata_config.yml &> log_ctx_1 &
CUDA_VISIBLE_DEVICES=0 trtllm-serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 --host localhost --port 8001 --server_role CONTEXT --extra_llm_api_options ./ctx_extra-llm-api-config.yml --metadata_server_config_file ./metadata_config.yml &> log_ctx_0 &
CUDA_VISIBLE_DEVICES=1 trtllm-serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 --host localhost --port 8002 --server_role CONTEXT --extra_llm_api_options ./ctx_extra-llm-api-config.yml --metadata_server_config_file ./metadata_config.yml &> log_ctx_1 &

# Generation servers
CUDA_VISIBLE_DEVICES=2 trtllm-serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 --host localhost --port 8003 --server_role GENERATION --extra_llm_api_options ./gen_extra-llm-api-config.yml --metadata_server_config_file ./metadata_config.yml &> log_gen_0 &
Expand Down Expand Up @@ -180,4 +195,4 @@ trtllm-serve disaggregated -c disagg_config.yaml

## Know Issues

The MPI communication backend for kvCache transfer has been deprecated and may not be supported in the future. When using the MPI backend, the environment variable `TRTLLM_USE_MPI_KVCACHE=1` should be set to avoid conflicts between mpi4py and kvCache transfer.
The MPI communication backend for KV cache transfer has been deprecated and may not be supported in the future. When using the MPI backend, the environment variable `TRTLLM_USE_MPI_KVCACHE=1` should be set to avoid conflicts between mpi4py and KV cache transfer.
4 changes: 2 additions & 2 deletions examples/disaggregated/disagg_config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -11,14 +11,14 @@ context_servers:
kv_cache_config:
free_gpu_memory_fraction: 0.2
cache_transceiver_config:
backend: "default"
backend: "DEFAULT"
urls:
- "localhost:8001"
generation_servers:
num_instances: 1
tensor_parallel_size: 1
pipeline_parallel_size: 1
cache_transceiver_config:
backend: "default"
backend: "DEFAULT"
urls:
- "localhost:8002"
4 changes: 2 additions & 2 deletions examples/disaggregated/slurm/gen_yaml.py
Original file line number Diff line number Diff line change
Expand Up @@ -197,7 +197,7 @@ def gen_config_file(config_path: str,
},
'cache_transceiver_config': {
'max_tokens_in_buffer': cache_transceiver_max_num_tokens,
'backend': 'default',
'backend': 'DEFAULT',
},
},
'generation_servers': {
Expand Down Expand Up @@ -225,7 +225,7 @@ def gen_config_file(config_path: str,
},
'cache_transceiver_config': {
'max_tokens_in_buffer': cache_transceiver_max_num_tokens,
'backend': 'default',
'backend': 'DEFAULT',
},
'stream_interval': 20,
}
Expand Down
2 changes: 1 addition & 1 deletion tensorrt_llm/llmapi/llm_args.py
Original file line number Diff line number Diff line change
Expand Up @@ -1039,7 +1039,7 @@ class CacheTransceiverConfig(StrictBaseModel, PybindMirror):
Configuration for the cache transceiver.
"""

backend: Optional[Literal["default", "ucx", "nixl", "mpi"]] = Field(
backend: Optional[Literal["DEFAULT", "UCX", "NIXL", "MPI"]] = Field(
default=None,
description=
"The communication backend type to use for the cache transceiver.")
Expand Down
40 changes: 20 additions & 20 deletions tests/integration/defs/accuracy/test_disaggregated_serving.py
Original file line number Diff line number Diff line change
Expand Up @@ -260,7 +260,7 @@ def run_parallel_test(model_name: str, model_path: str, ctx_pp: int,
"disable_overlap_scheduler": True,
"kv_cache_config": kv_cache_config,
"cache_transceiver_config": {
"backend": "default"
"backend": "DEFAULT"
}
}
gen_server_config = {
Expand All @@ -269,7 +269,7 @@ def run_parallel_test(model_name: str, model_path: str, ctx_pp: int,
"disable_overlap_scheduler": True,
"kv_cache_config": kv_cache_config,
"cache_transceiver_config": {
"backend": "default"
"backend": "DEFAULT"
}
}

Expand Down Expand Up @@ -309,8 +309,8 @@ def test_auto_dtype(self, disable_overlap_scheduler):
gen_server_config = {
"disable_overlap_scheduler": disable_overlap_scheduler
}
ctx_server_config["cache_transceiver_config"] = {"backend": "default"}
gen_server_config["cache_transceiver_config"] = {"backend": "default"}
ctx_server_config["cache_transceiver_config"] = {"backend": "DEFAULT"}
gen_server_config["cache_transceiver_config"] = {"backend": "DEFAULT"}
disaggregated_server_config = {
"hostname": "localhost",
"port": 8000,
Expand Down Expand Up @@ -351,15 +351,15 @@ def test_ngram(self):
"disable_overlap_scheduler": True,
"kv_cache_config": kv_cache_config,
"cache_transceiver_config": {
"backend": "default"
"backend": "DEFAULT"
}
}
gen_server_config = {
"disable_overlap_scheduler": True,
"speculative_config": speculative_decoding_config,
"kv_cache_config": kv_cache_config,
"cache_transceiver_config": {
"backend": "default"
"backend": "DEFAULT"
}
}
disaggregated_server_config = {
Expand Down Expand Up @@ -404,7 +404,7 @@ def test_eagle3(self, overlap_scheduler, eagle3_one_model):
"max_num_tokens": 13393 * 2,
"max_batch_size": 1,
"cache_transceiver_config": {
"backend": "default"
"backend": "DEFAULT"
},
"cuda_graph_config": None,
}
Expand All @@ -418,7 +418,7 @@ def test_eagle3(self, overlap_scheduler, eagle3_one_model):
"max_num_tokens": 13393 * 2,
"max_batch_size": 16,
"cache_transceiver_config": {
"backend": "default"
"backend": "DEFAULT"
},
"cuda_graph_config": None,
}
Expand Down Expand Up @@ -472,8 +472,8 @@ class TestLlama4ScoutInstruct(LlmapiAccuracyTestHarness):
def test_auto_dtype(self, overlap_scheduler):
ctx_server_config = {"disable_overlap_scheduler": True}
gen_server_config = {"disable_overlap_scheduler": overlap_scheduler}
ctx_server_config["cache_transceiver_config"] = {"backend": "default"}
gen_server_config["cache_transceiver_config"] = {"backend": "default"}
ctx_server_config["cache_transceiver_config"] = {"backend": "DEFAULT"}
gen_server_config["cache_transceiver_config"] = {"backend": "DEFAULT"}
# Keep this low to avoid warmup OOM in CI
ctx_server_config["max_seq_len"] = 8192
gen_server_config["max_seq_len"] = 8192
Expand Down Expand Up @@ -513,13 +513,13 @@ def test_nixl_backend(self):
ctx_server_config = {
"disable_overlap_scheduler": True,
"cache_transceiver_config": {
"backend": "nixl"
"backend": "NIXL"
}
}
gen_server_config = {
"disable_overlap_scheduler": True,
"cache_transceiver_config": {
"backend": "nixl"
"backend": "NIXL"
}
}
disaggregated_server_config = {
Expand Down Expand Up @@ -550,8 +550,8 @@ def test_nixl_backend(self):
def test_auto_dtype(self, overlap_scheduler, mtp_nextn):
ctx_server_config = {"disable_overlap_scheduler": True}
gen_server_config = {"disable_overlap_scheduler": not overlap_scheduler}
ctx_server_config["cache_transceiver_config"] = {"backend": "default"}
gen_server_config["cache_transceiver_config"] = {"backend": "default"}
ctx_server_config["cache_transceiver_config"] = {"backend": "DEFAULT"}
gen_server_config["cache_transceiver_config"] = {"backend": "DEFAULT"}
if mtp_nextn > 0:
ctx_server_config["speculative_config"] = {
"decoding_type": "MTP",
Expand Down Expand Up @@ -597,14 +597,14 @@ def test_auto_dtype(self, overlap_scheduler):
"disable_overlap_scheduler": True,
"cuda_graph_config": None,
"cache_transceiver_config": {
"backend": "default"
"backend": "DEFAULT"
}
}
gen_server_config = {
"disable_overlap_scheduler": overlap_scheduler,
"cuda_graph_config": None,
"cache_transceiver_config": {
"backend": "default"
"backend": "DEFAULT"
}
}
ctx_server_config["kv_cache_config"] = {
Expand Down Expand Up @@ -648,13 +648,13 @@ def test_nixl_backend(self):
ctx_server_config = {
"disable_overlap_scheduler": True,
"cache_transceiver_config": {
"backend": "nixl"
"backend": "NIXL"
}
}
gen_server_config = {
"disable_overlap_scheduler": True,
"cache_transceiver_config": {
"backend": "nixl"
"backend": "NIXL"
}
}
ctx_server_config["cache_transceiver_config"]
Expand Down Expand Up @@ -686,14 +686,14 @@ def test_auto_dtype(self, overlap_scheduler):
"disable_overlap_scheduler": True,
"cuda_graph_config": None,
"cache_transceiver_config": {
"backend": "default"
"backend": "DEFAULT"
}
}
gen_server_config = {
"disable_overlap_scheduler": overlap_scheduler,
"cuda_graph_config": None,
"cache_transceiver_config": {
"backend": "default"
"backend": "DEFAULT"
}
}
disaggregated_server_config = {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ context_servers:
event_buffer_max_size: 1024
free_gpu_memory_fraction: 0.1
cache_transceiver_config:
backend: default
backend: DEFAULT
urls:
- "localhost:8001"
- "localhost:8002"
Expand All @@ -35,7 +35,7 @@ generation_servers:
tensor_parallel_size: 1
pipeline_parallel_size: 1
cache_transceiver_config:
backend: default
backend: DEFAULT
kv_cache_config:
enable_block_reuse: True
enable_partial_reuse: False
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ context_servers:
event_buffer_max_size: 1024
free_gpu_memory_fraction: 0.1
cache_transceiver_config:
backend: "default"
backend: "DEFAULT"
urls:
- "localhost:8001"
- "localhost:8002"
Expand All @@ -33,7 +33,7 @@ generation_servers:
event_buffer_max_size: 1024
free_gpu_memory_fraction: 0.1
cache_transceiver_config:
backend: "default"
backend: "DEFAULT"
urls:
- "localhost:8003"
- "localhost:8004"
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ context_servers:
enable_partial_reuse: True
event_buffer_max_size: 1024
cache_transceiver_config:
backend: default
backend: DEFAULT
urls:
- "localhost:8001"
generation_servers:
Expand All @@ -30,6 +30,6 @@ generation_servers:
event_buffer_max_size: 1024
free_gpu_memory_fraction: 0.05
cache_transceiver_config:
backend: default
backend: DEFAULT
urls:
- "localhost:8002"
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ context_servers:
enable_partial_reuse: True
event_buffer_max_size: 1024
cache_transceiver_config:
backend: default
backend: DEFAULT
urls:
- "localhost:8001"
generation_servers:
Expand All @@ -30,6 +30,6 @@ generation_servers:
event_buffer_max_size: 1024
free_gpu_memory_fraction: 0.05
cache_transceiver_config:
backend: default
backend: DEFAULT
urls:
- "localhost:8002"
Loading
Loading