diff --git a/components/backends/sglang/README.md b/components/backends/sglang/README.md index 8e2a095d67..9a6fe27b50 100644 --- a/components/backends/sglang/README.md +++ b/components/backends/sglang/README.md @@ -43,11 +43,11 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1)) ### Large Scale P/D and WideEP Features -| Feature | SGLang | Notes | -|--------------------|--------|-----------------------------------------------------------------------| -| **WideEP** | ✅/🚧 | Full support on H100s/GB200 WIP [PR](https://github.com/sgl-project/sglang/pull/7556) | -| **DP Rank Routing**| 🚧 | Direct routing supported. Process per DP rank is not supported | -| **GB200 Support** | 🚧 | WIP [PR](https://github.com/sgl-project/sglang/pull/7556) | +| Feature | SGLang | Notes | +|---------------------|--------|--------------------------------------------------------------| +| **WideEP** | ✅ | Full support on H100s/GB200 | +| **DP Rank Routing** | 🚧 | Direct routing supported. Dynamo KV router does not router to DP worker | +| **GB200 Support** | ✅ | | ## Quick Start @@ -155,7 +155,7 @@ This allows a request to be migrated up to 3 times before failing. See the [Requ Below we provide a selected list of advanced examples. Please open up an issue if you'd like to see a specific example! -### Run on multi-node +### Run a multi-node sized model - **[Run a multi-node model](docs/multinode-examples.md)** ### Large scale P/D disaggregation with WideEP diff --git a/components/backends/sglang/docs/dsr1-wideep-gb200.md b/components/backends/sglang/docs/dsr1-wideep-gb200.md new file mode 100644 index 0000000000..c01eaab079 --- /dev/null +++ b/components/backends/sglang/docs/dsr1-wideep-gb200.md @@ -0,0 +1,171 @@ + + +# Running DeepSeek-R1 Disaggregated with WideEP on GB200s + +Dynamo supports SGLang's GB200 implementation of wide expert parallelism and large scale P/D for DeepSeek-R1! You can read their blog post [here](https://lmsys.org/blog/2025-06-16-gb200-part-1/) for more details. Full end to end optimization is still a work in progress but you can get this up and running with the following steps. In ths example, we will run 1 prefill worker on 2 GB200 nodes (4 GPUs each) and 1 decode worker on 12 GB200 nodes (total 56 GPUs). + +## Instructions + +1. Build the Dynamo container + +```bash +cd $DYNAMO_ROOT +docker build \ + -f container/Dockerfile.sglang-wideep \ + -t dynamo-wideep-gb200 \ + --build-arg MODE=blackwell \ + --build-arg SGLANG_IMAGE_TAG=v0.4.9.post6-cu128-gb200 \ + --build-arg ARCH=arm64 \ + --build-arg ARCH_ALT=aarch64 \ + . +``` + +2. You can run this container on each 4xGB200 node using the following command. + +> [!IMPORTANT] +> We recommend downloading DeepSeek-R1 and then mounting it to the container. You can find the model [here](https://huggingface.co/deepseek-ai/DeepSeek-R1) + +```bash +docker run \ + --gpus all \ + -it \ + --rm \ + --network host \ + --volume /PATH_TO_DSR1_MODEL/:/model/ \ + --shm-size=10G \ + --ulimit memlock=-1 \ + --ulimit stack=67108864 \ + --ulimit nofile=65536:65536 \ + --cap-add CAP_SYS_PTRACE \ + --ipc host \ + dynamo-wideep-gb200:latest +``` + +3. On the head prefill node, run the helper script provided to generate commands to start the `nats-server`, `etcd`. This script will also tell you which environment variables to export on each node to make deployment easier. + +```bash +./utils/gen_env_vars.sh +``` + +4. Run the ingress and prefill worker + +```bash +# run ingress +python3 -m dynamo.frontend --http-port=8000 & +# optionally run the http server that allows you to flush the kv cache for all workers (see benchmarking section below) +python3 utils/sgl_http_server.py --ns dynamo & +# run prefill worker +SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=2048 \ +MC_TE_METRIC=true \ +SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE=100000 \ +SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=100000 \ +SGLANG_DISAGGREGATION_WAITING_TIMEOUT=100000 \ +SGLANG_MOONCAKE_CUSTOM_MEM_POOL=True \ +MC_FORCE_MNNVL=1 \ +NCCL_MNNVL_ENABLE=1 \ +NCCL_CUMEM_ENABLE=1 \ +SGLANG_USE_MESSAGE_QUEUE_BROADCASTER=0 \ +SGL_DISABLE_TP_MEMORY_INBALANCE_CHECK=1 \ +PYTHONUNBUFFERED=1 \ +python3 components/worker.py \ + --served-model-name deepseek-ai/DeepSeek-R1 \ + --model-path /model/ \ + --skip-tokenizer-init \ + --trust-remote-code \ + --disaggregation-mode prefill \ + --dist-init-addr ${HEAD_PREFILL_NODE_IP}:29500 \ + --disaggregation-bootstrap-port 30001 \ + --disaggregation-transfer-backend nixl \ + --nnodes 2 \ + --node-rank 0 \ + --tp-size 8 \ + --dp-size 8 \ + --enable-dp-attention \ + --host 0.0.0.0 \ + --decode-log-interval 1 \ + --max-running-requests 6144 \ + --context-length 2716 \ + --disable-radix-cache \ + --enable-deepep-moe \ + --deepep-mode low_latency \ + --moe-dense-tp-size 1 \ + --enable-dp-lm-head \ + --disable-shared-experts-fusion \ + --ep-num-redundant-experts 32 \ + --ep-dispatch-algorithm static \ + --eplb-algorithm deepseek \ + --attention-backend cutlass_mla \ + --watchdog-timeout 1000000 \ + --disable-cuda-graph \ + --chunked-prefill-size 16384 \ + --max-total-tokens 32768 \ + --mem-fraction-static 0.8 \ + --log-level debug +``` + +5. Run the decode worker on the head decode node + +```bash +SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=768 \ +MC_TE_METRIC=true \ +SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE=100000 \ +SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=100000 \ +SGLANG_DISAGGREGATION_WAITING_TIMEOUT=100000 \ +SGLANG_HACK_SEQ_BOOTSTRAP_ROOM=1 \ +SGLANG_MOONCAKE_CUSTOM_MEM_POOL=True \ +NCCL_MNNVL_ENABLE=1 \ +MC_FORCE_MNNVL=1 \ +NCCL_CUMEM_ENABLE=1 \ +SGLANG_USE_MESSAGE_QUEUE_BROADCASTER=0 \ +SGL_DISABLE_TP_MEMORY_INBALANCE_CHECK=1 \ +PYTHONUNBUFFERED=1 \ +python3 components/decode_worker.py \ + --served-model-name deepseek-ai/DeepSeek-R1 \ + --model-path /model/ \ + --skip-tokenizer-init \ + --trust-remote-code \ + --disaggregation-mode decode \ + --dist-init-addr ${HEAD_DECODE_NODE_IP}:29500 \ + --disaggregation-bootstrap-port 30001 \ + --nnodes 12 \ + --node-rank 0 \ + --tp-size 48 \ + --dp-size 48 \ + --enable-dp-attention \ + --host 0.0.0.0 \ + --decode-log-interval 1 \ + --max-running-requests 36864 \ + --context-length 2716 \ + --disable-radix-cache \ + --enable-deepep-moe \ + --deepep-mode low_latency \ + --moe-dense-tp-size 1 \ + --enable-dp-lm-head \ + --cuda-graph-bs 768 \ + --disable-shared-experts-fusion \ + --ep-num-redundant-experts 32 \ + --ep-dispatch-algorithm static \ + --eplb-algorithm deepseek \ + --attention-backend cutlass_mla \ + --watchdog-timeout 1000000 \ + --chunked-prefill-size 36864 \ + --mem-fraction-static 0.82 \ + --log-level debug +``` + +On the other decode nodes (this example has 12 total decode nodes), run the same command but change `--node-rank` to 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11 diff --git a/components/backends/sglang/docs/dsr1-wideep-h100.md b/components/backends/sglang/docs/dsr1-wideep-h100.md index a23a3ada13..fcd43d6e66 100644 --- a/components/backends/sglang/docs/dsr1-wideep-h100.md +++ b/components/backends/sglang/docs/dsr1-wideep-h100.md @@ -9,22 +9,16 @@ Dynamo supports SGLang's implementation of wide expert parallelism and large sca ## Instructions -1. Pull the SGLang release `v0.4.8.post1` container. We are actively working on validating newer releases. - -```bash -docker pull lmsysorg/sglang:v0.4.8.post1-cu126 -``` - -You can also pull a specific tag from the [lmsys dockerhub](https://hub.docker.com/r/lmsysorg/sglang/tags) - -2. Build the Dynamo container +1. Build the Dynamo container ```bash cd $DYNAMO_ROOT docker build -f container/Dockerfile.sglang-wideep . -t dynamo-wideep --no-cache ``` -3. You can run this container on each 8xH100 node using the following command. +You can use a specific tag from the [lmsys dockerhub](https://hub.docker.com/r/lmsysorg/sglang/tags) by adding `--build-arg SGLANG_IMAGE_TAG=` to the build command. + +2. You can run this container on each 8xH100 node using the following command. > [!IMPORTANT] > We recommend downloading DeepSeek-R1 and then mounting it to the container. You can find the model [here](https://huggingface.co/deepseek-ai/DeepSeek-R1) @@ -47,17 +41,17 @@ docker run \ In each container, you should be in the `/sgl-workspace/dynamo/components/backends/sglang` directory. -4. On the head prefill node, run the helper script provided to generate commands to start the `nats-server`, `etcd`. This script will also tell you which environment variables to export on each node to make deployment easier. +3. On the head prefill node, run the helper script provided to generate commands to start the `nats-server`, `etcd`. This script will also tell you which environment variables to export on each node to make deployment easier. ```bash ./utils/gen_env_vars.sh ``` -5. Run the ingress and prefill worker +4. Run the ingress and prefill worker ```bash # run ingress -dynamo run in=http out=dyn & +python3 -m dynamo.frontend --http-port=8000 & # optionally run the http server that allows you to flush the kv cache for all workers (see benchmarking section below) python3 utils/sgl_http_server.py --ns dynamo & # run prefill worker @@ -93,7 +87,7 @@ python3 -m dynamo.sglang.worker \ On the other prefill node (since this example has 4 total prefill nodes), run the same command but change `--node-rank` to 1,2, and 3 -7. Run the decode worker on the head decode node +5. Run the decode worker on the head decode node ```bash python3 -m dynamo.sglang.decode_worker \ @@ -121,7 +115,7 @@ python3 -m dynamo.sglang.decode_worker \ --deepep-mode low_latency \ --mem-fraction-static 0.835 \ --ep-num-redundant-experts 32 \ - --cuda-graph-bs 256 + --cuda-graph-bs 128 ``` On the other decode nodes (this example has 9 total decode nodes), run the same command but change `--node-rank` to 1, 2, 3, 4, 5, 6, 7, and 8 @@ -131,6 +125,7 @@ On the other decode nodes (this example has 9 total decode nodes), run the same In the official [blog post repro instructions](https://github.com/sgl-project/sglang/issues/6017), SGL uses batch inference to benchmark their prefill and decode workers. They do this by pretokenizing the ShareGPT dataset and then creating a batch of 8192 requests with ISL 4096 and OSL 5 (for prefill stress test) and a batch of 40000 with ISL 2000 and OSL 100 (for decode stress test). If you want to repro these benchmarks, you will need to add the following flags to the prefill and decode commands: prefill: + ```bash ... --max-running-requests 8192 \ @@ -142,6 +137,7 @@ prefill: ``` decode: + ```bash ... --max-running-requests 18432 \ @@ -152,9 +148,10 @@ decode: We currently provide 2 different ways to perform an end to end benchmark which includes using our OpenAI frontend and tokenization. We will continue to add better support for these sorts of large single batch workloads in the future. 1. **GenAI Perf to benchmark end to end performance with 8k ISL 256 OSL** -We've found that 8k ISL 256 OSL provides a good baseline for measuring end to end disaggregated serving performance for DSR1. As WideEP allows for a higher throughput, we provide a script that runs this workload at high concurrencies. DeepGEMM kernels can sometimes take a while to warm up. We provide a short ramping warmup script that can be used. + We've found that 8k ISL 256 OSL provides a good baseline for measuring end to end disaggregated serving performance for DSR1. As WideEP allows for a higher throughput, we provide a script that runs this workload at high concurrencies. DeepGEMM kernels can sometimes take a while to warm up. We provide a short ramping warmup script that can be used. Example usage: + ```bash # warmup ./utils/bench.sh HEAD_PREFILL_NODE_IP --type warmup @@ -165,9 +162,10 @@ curl -X POST http://${HEAD_PREFILL_NODE_IP}:9001/flush_cache ``` 2. **GenAI Perf to benchmark completions with custom dataset** -We provide a script that generates a JSONL file of the ShareGPT dataset and then use GenAI Perf to benchmark the prefill and decode workers. We use ShareGPT in order to leverage the pre-existing EPLB distributions provided by the SGLang team. If you don't want to use ShareGPT - you can also use GenAIPerf's synthetic dataset setup But note you will have to use dynamic EPLB configurations or record your own as the `init-expert-location` provided by SGLang is tuned specifically for the ShareGPT dataset at a 4096 ISL and 5 OSL. + We provide a script that generates a JSONL file of the ShareGPT dataset and then use GenAI Perf to benchmark the prefill and decode workers. We use ShareGPT in order to leverage the pre-existing EPLB distributions provided by the SGLang team. If you don't want to use ShareGPT - you can also use GenAI Perf's synthetic dataset setup But note you will have to use dynamic EPLB configurations or record your own as the `init-expert-location` provided by SGLang is tuned specifically for the ShareGPT dataset at a 4096 ISL and 5 OSL. Example usage: + ```bash # generate data python3 src/dynamo/sglang/utils/generate_bench_data.py --output data.jsonl --num-prompts 8192 --input-len 4096 --output-len 5 --model deepseek-ai/DeepSeek-R1 diff --git a/components/backends/sglang/slurm_jobs/README.md b/components/backends/sglang/slurm_jobs/README.md index 19f7c27ada..fc6b437ff6 100644 --- a/components/backends/sglang/slurm_jobs/README.md +++ b/components/backends/sglang/slurm_jobs/README.md @@ -45,6 +45,7 @@ logs/ ## Setup For simplicity of the example, we will make some assumptions about your SLURM cluster: + 1. We assume you have access to a SLURM cluster with multiple GPU nodes available. For functional testing, most setups should be fine. For performance testing, you should aim to allocate groups of nodes that are performantly @@ -61,7 +62,11 @@ For simplicity of the example, we will make some assumptions about your SLURM cl ## Usage +> [!NOTE] +> The logic for finding prefill and decode node IPs in [`job_script_template.j2`](job_script_template.j2) is still a work in progress. You may need to tweak the `srun`/`ip route`/`getent`/`awk` bits for your cluster, especially if your networking or hostname conventions differ. PRs and suggestions welcome. + 1. **Submit a benchmark job**: + ```bash python submit_job_script.py \ --template job_script_template.j2 \ @@ -72,6 +77,7 @@ For simplicity of the example, we will make some assumptions about your SLURM cl ``` **Required arguments**: + - `--template`: Path to Jinja2 template file - `--model-dir`: Model directory path - `--config-dir`: Config directory path @@ -79,26 +85,65 @@ For simplicity of the example, we will make some assumptions about your SLURM cl - `--account`: SLURM account **Optional arguments**: + - `--prefill-nodes`: Number of prefill nodes (default: `2`) - `--decode-nodes`: Number of decode nodes (default: `2`) - `--gpus-per-node`: Number of GPUs per node (default: `8`) - `--network-interface`: Network interface to use (default: `eth3`) - `--job-name`: SLURM job name (default: `dynamo_setup`) - `--time-limit`: Time limit in HH:MM:SS format (default: `01:00:00`) + - `--gpu-type`: GPU type to use, choices: `h100`, `gb200` (default: `h100`) + - `--use-sglang-commands`: Use SGLang commands instead of Dynamo (default: `false`) **Note**: The script automatically calculates the total number of nodes needed based on `--prefill-nodes` and `--decode-nodes` parameters. -2. **Monitor job progress**: +2. **Example with different GPU types**: + + ```bash + # For H100 with Dynamo (default) + python submit_job_script.py \ + --template job_script_template.j2 \ + --model-dir /path/to/model \ + --config-dir /path/to/configs \ + --container-image container-image-uri \ + --account your-slurm-account \ + --gpu-type h100 + + # For GB200 with SGLang + python submit_job_script.py \ + --template job_script_template.j2 \ + --model-dir /path/to/model \ + --config-dir /path/to/configs \ + --container-image container-image-uri \ + --account your-slurm-account \ + --gpu-type gb200 \ + --use-sglang-commands + --gpus-per-node 4 + ``` + +3. **Monitor job progress**: + ```bash squeue -u $USER ``` -3. **Check logs in real-time**: +4. **Check logs in real-time**: + ```bash tail -f logs/{JOB_ID}/log.out ``` -4. **Monitor GPU utilization**: + You can view logs of all prefill or decode workers simultaneously by running: + + ```bash + # prefill workers err (or .out) + tail -f logs/{JOB_ID}/*_prefill.err + + # decode workers err (or .out) + tail -f logs/{JOB_ID}/*_decode.err + ``` + +5. **Monitor GPU utilization**: ```bash tail -f logs/{JOB_ID}/{node}_prefill_gpu_utilization.log ``` diff --git a/components/backends/sglang/slurm_jobs/job_script_template.j2 b/components/backends/sglang/slurm_jobs/job_script_template.j2 index 84e0e33396..0f9b500bd4 100755 --- a/components/backends/sglang/slurm_jobs/job_script_template.j2 +++ b/components/backends/sglang/slurm_jobs/job_script_template.j2 @@ -7,6 +7,7 @@ #SBATCH --time={{ time_limit }} #SBATCH --output=logs/%j/log.out #SBATCH --error=logs/%j/log.err +#SBATCH --partition={{ partition }} # Constants PREFILL_NODES={{ prefill_nodes }} @@ -20,6 +21,8 @@ MODEL_DIR="{{ model_dir }}" CONFIG_DIR="{{ config_dir }}" CONTAINER_IMAGE="{{ container_image }}" NETWORK_INTERFACE="{{ network_interface }}" +GPU_TYPE="{{ gpu_type | default('h100') }}" +USE_SGLANG_COMMANDS="{{ use_sglang_commands | default(false) }}" {% raw %} @@ -36,14 +39,14 @@ for i in "${!nodes[@]}"; do echo "Node $i: ${nodes[$i]}" done -PREFILL_HOST_IP=$(srun --nodes=1 --ntasks=1 --nodelist=${nodes[0]} ifconfig $NETWORK_INTERFACE | grep -oP 'inet \K[0-9.]+') +PREFILL_HOST_IP=$(srun --nodes=1 --ntasks=1 --nodelist=${nodes[0]} ip route get $(getent ahosts ${nodes[0]} | grep STREAM | head -1 | awk '{print $1}') | awk '{for(i=1;i<=NF;i++) if($i=="src") print $(i+1)}') if [ -z "$PREFILL_HOST_IP" ]; then echo "Error: Could not retrieve IP address for prefill host ${nodes[0]} on interface $NETWORK_INTERFACE" exit 1 fi echo "Prefill host IP address: $PREFILL_HOST_IP" -DECODE_HOST_IP=$(srun --nodes=1 --ntasks=1 --nodelist=${nodes[$PREFILL_NODES]} ifconfig $NETWORK_INTERFACE | grep -oP 'inet \K[0-9.]+') +DECODE_HOST_IP=$(srun --nodes=1 --ntasks=1 --nodelist=${nodes[$PREFILL_NODES]} ip route get $(getent ahosts ${nodes[$PREFILL_NODES]} | grep STREAM | head -1 | awk '{print $1}') | awk '{for(i=1;i<=NF;i++) if($i=="src") print $(i+1)}') if [ -z "$DECODE_HOST_IP" ]; then echo "Error: Could not retrieve IP address for decode host ${nodes[$PREFILL_NODES]} on interface $NETWORK_INTERFACE" exit 1 @@ -54,21 +57,25 @@ echo "Decode host IP address: $DECODE_HOST_IP" ENROOT_ARGS="\ --container-image=${CONTAINER_IMAGE} \ --no-container-entrypoint \ - --container-mount-home \ - --no-container-remap-root \ + --no-container-mount-home \ --container-mounts=${MODEL_DIR}:/model/,${CONFIG_DIR}:/configs/,${SCRIPT_DIR}:/scripts/,${OUTPUT_DIR}:/outputs/,${LOG_DIR}:/logs/ \ " +# Build common worker arguments +WORKER_ARGS="--gpu_type ${GPU_TYPE} --gpus_per_node ${GPUS_PER_NODE}" +if [ "$USE_SGLANG_COMMANDS" = "True" ]; then + WORKER_ARGS="${WORKER_ARGS} --use-sglang-commands" +fi + # Launch prefill tasks on the first PREFILL_NODES nodes for i in $(seq 0 $((PREFILL_NODES - 1))); do node=${nodes[$i]} rank=$i echo "Launching prefill task on node ${i} (rank ${rank}): $node" - echo "Srun args: $ENROOT_ARGS --nodes=1 --ntasks=1 --nodelist=$node --output=${LOG_DIR}/${node}_prefill.out --error=${LOG_DIR}/${node}_prefill.err" - echo "Command: python /scripts/worker_setup.py --prefill_host_ip ${PREFILL_HOST_IP} --decode_host_ip ${DECODE_HOST_IP} --rank ${rank} --total_nodes ${PREFILL_NODES} --worker_type prefill --gpus_per_node ${GPUS_PER_NODE} --gpu_utilization_log /logs/${node}_prefill_gpu_utilization.log &" - srun $ENROOT_ARGS --nodes=1 --ntasks=1 --nodelist=$node \ - --output=${LOG_DIR}/${node}_prefill.out --error=${LOG_DIR}/${node}_prefill.err \ - python /scripts/worker_setup.py --prefill_host_ip ${PREFILL_HOST_IP} --decode_host_ip ${DECODE_HOST_IP} --rank ${rank} --total_nodes ${PREFILL_NODES} --worker_type prefill --gpus_per_node ${GPUS_PER_NODE} --gpu_utilization_log /logs/${node}_prefill_gpu_utilization.log & + + cmd="srun $ENROOT_ARGS --nodes=1 --ntasks=1 --nodelist=$node --output=${LOG_DIR}/${node}_prefill.out --error=${LOG_DIR}/${node}_prefill.err python /scripts/worker_setup.py --prefill_host_ip ${PREFILL_HOST_IP} --decode_host_ip ${DECODE_HOST_IP} --rank ${rank} --total_nodes ${PREFILL_NODES} --worker_type prefill --gpu_utilization_log /logs/${node}_prefill_gpu_utilization.log ${WORKER_ARGS}" + echo "$cmd" + $cmd & done # Launch decode tasks on the next DECODE_NODES nodes @@ -76,11 +83,10 @@ for i in $(seq $PREFILL_NODES $((PREFILL_NODES + DECODE_NODES - 1))); do node=${nodes[$i]} rank=$((i - PREFILL_NODES)) echo "Launching decode task on node ${i} (rank ${rank}): $node" - echo "Srun args: $ENROOT_ARGS --nodes=1 --ntasks=1 --nodelist=$node --output=${LOG_DIR}/${node}_decode.out --error=${LOG_DIR}/${node}_decode.err" - echo "Command: python /scripts/worker_setup.py --decode_host_ip ${DECODE_HOST_IP} --prefill_host_ip ${PREFILL_HOST_IP} --rank ${rank} --total_nodes ${DECODE_NODES} --worker_type decode --gpus_per_node ${GPUS_PER_NODE} --gpu_utilization_log /logs/${node}_decode_gpu_utilization.log &" - srun $ENROOT_ARGS --nodes=1 --ntasks=1 --nodelist=$node \ - --output=${LOG_DIR}/${node}_decode.out --error=${LOG_DIR}/${node}_decode.err \ - python /scripts/worker_setup.py --decode_host_ip ${DECODE_HOST_IP} --prefill_host_ip ${PREFILL_HOST_IP} --rank ${rank} --total_nodes ${DECODE_NODES} --worker_type decode --gpus_per_node ${GPUS_PER_NODE} --gpu_utilization_log /logs/${node}_decode_gpu_utilization.log & + + cmd="srun $ENROOT_ARGS --nodes=1 --ntasks=1 --nodelist=$node --output=${LOG_DIR}/${node}_decode.out --error=${LOG_DIR}/${node}_decode.err python /scripts/worker_setup.py --decode_host_ip ${DECODE_HOST_IP} --prefill_host_ip ${PREFILL_HOST_IP} --rank ${rank} --total_nodes ${DECODE_NODES} --worker_type decode --gpu_utilization_log /logs/${node}_decode_gpu_utilization.log ${WORKER_ARGS}" + echo "$cmd" + $cmd & done echo "" diff --git a/components/backends/sglang/slurm_jobs/scripts/gb200.sh b/components/backends/sglang/slurm_jobs/scripts/gb200.sh new file mode 100755 index 0000000000..cbc8cbce88 --- /dev/null +++ b/components/backends/sglang/slurm_jobs/scripts/gb200.sh @@ -0,0 +1,272 @@ +#!/bin/bash +# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 + +# Function to print usage +print_usage() { + echo "Usage: $0 " + echo " mode: prefill or decode" + echo " cmd: dynamo or sglang" + echo "" + echo "Examples:" + echo " $0 prefill dynamo" + echo " $0 decode sglang" + exit 1 +} + +# Check if correct number of arguments provided +if [ $# -ne 2 ]; then + echo "Error: Expected 2 arguments, got $#" + print_usage +fi + +# Parse arguments +mode=$1 +cmd=$2 + +# Validate mode argument +if [ "$mode" != "prefill" ] && [ "$mode" != "decode" ]; then + echo "Error: mode must be 'prefill' or 'decode', got '$mode'" + print_usage +fi + +# Validate cmd argument +if [ "$cmd" != "dynamo" ] && [ "$cmd" != "sglang" ]; then + echo "Error: cmd must be 'dynamo' or 'sglang', got '$cmd'" + print_usage +fi + +echo "Mode: $mode" +echo "Command: $cmd" + + +# Check if required environment variables are set +if [ -z "$HOST_IP" ]; then + echo "Error: HOST_IP environment variable is not set" + exit 1 +fi + +if [ -z "$PORT" ]; then + echo "Error: PORT environment variable is not set" + exit 1 +fi + +if [ -z "$TOTAL_GPUS" ]; then + echo "Error: TOTAL_GPUS environment variable is not set" + exit 1 +fi + +if [ -z "$RANK" ]; then + echo "Error: RANK environment variable is not set" + exit 1 +fi + +if [ -z "$TOTAL_NODES" ]; then + echo "Error: TOTAL_NODES environment variable is not set" + exit 1 +fi + +# TODO: since the args for sglang and dynamo are the same, we can be a bit cleaner here + +# Construct command based on mode and cmd +if [ "$mode" = "prefill" ]; then + if [ "$cmd" = "dynamo" ]; then + # We are not using a init-expert-location file for e2e benchmarking + # We also don't currently have a --deepep-config file for GB200 + # Need to increase --context-length to 10k for 8k1k benchmarking + SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=2048 \ + MC_TE_METRIC=true \ + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE=100000 \ + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=100000 \ + SGLANG_DISAGGREGATION_WAITING_TIMEOUT=100000 \ + SGLANG_MOONCAKE_CUSTOM_MEM_POOL=True \ + MC_FORCE_MNNVL=1 \ + NCCL_MNNVL_ENABLE=1 \ + NCCL_CUMEM_ENABLE=1 \ + SGLANG_USE_MESSAGE_QUEUE_BROADCASTER=0 \ + SGL_DISABLE_TP_MEMORY_INBALANCE_CHECK=1 \ + PYTHONUNBUFFERED=1 \ + python3 components/worker.py \ + --served-model-name deepseek-ai/DeepSeek-R1 \ + --model-path /model/ \ + --skip-tokenizer-init \ + --trust-remote-code \ + --disaggregation-mode prefill \ + --dist-init-addr "$HOST_IP:$PORT" \ + --disaggregation-bootstrap-port 30001 \ + --disaggregation-transfer-backend nixl \ + --nnodes "$TOTAL_NODES" \ + --node-rank "$RANK" \ + --tp-size "$TOTAL_GPUS" \ + --dp-size "$TOTAL_GPUS" \ + --enable-dp-attention \ + --host 0.0.0.0 \ + --decode-log-interval 1 \ + --max-running-requests 6144 \ + --context-length 2716 \ + --disable-radix-cache \ + --enable-deepep-moe \ + --deepep-mode low_latency \ + --moe-dense-tp-size 1 \ + --enable-dp-lm-head \ + --disable-shared-experts-fusion \ + --ep-num-redundant-experts 32 \ + --ep-dispatch-algorithm static \ + --eplb-algorithm deepseek \ + --attention-backend cutlass_mla \ + --watchdog-timeout 1000000 \ + --disable-cuda-graph \ + --chunked-prefill-size 16384 \ + --max-total-tokens 32768 \ + --mem-fraction-static 0.8 \ + --log-level debug + + elif [ "$cmd" = "sglang" ]; then + # GB200 sglang prefill command + # We are not using a init-expert-location file for e2e benchmarking + # We also don't currently have a --deepep-config file for GB200 + # Need to increase --context-length to 10k for 8k1k benchmarking + SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=2048 \ + MC_TE_METRIC=true \ + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE=100000 \ + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=100000 \ + SGLANG_DISAGGREGATION_WAITING_TIMEOUT=100000 \ + SGLANG_MOONCAKE_CUSTOM_MEM_POOL=True \ + NCCL_MNNVL_ENABLE=1 \ + MC_FORCE_MNNVL=1 \ + NCCL_CUMEM_ENABLE=1 \ + SGLANG_USE_MESSAGE_QUEUE_BROADCASTER=0 \ + SGL_DISABLE_TP_MEMORY_INBALANCE_CHECK=1 \ + PYTHONUNBUFFERED=1 \ + python3 -m sglang.launch_server \ + --served-model-name deepseek-ai/DeepSeek-R1 \ + --model-path /model/ \ + --trust-remote-code \ + --disaggregation-mode prefill \ + --dist-init-addr "$HOST_IP:$PORT" \ + --disaggregation-bootstrap-port 30001 \ + --nnodes "$TOTAL_NODES" \ + --node-rank "$RANK" \ + --tp-size "$TOTAL_GPUS" \ + --dp-size "$TOTAL_GPUS" \ + --enable-dp-attention \ + --host 0.0.0.0 \ + --decode-log-interval 1 \ + --max-running-requests 6144 \ + --context-length 2716 \ + --disable-radix-cache \ + --enable-deepep-moe \ + --deepep-mode low_latency \ + --moe-dense-tp-size 1 \ + --enable-dp-lm-head \ + --disable-shared-experts-fusion \ + --ep-num-redundant-experts 32 \ + --ep-dispatch-algorithm static \ + --eplb-algorithm deepseek \ + --attention-backend cutlass_mla \ + --watchdog-timeout 1000000 \ + --disable-cuda-graph \ + --chunked-prefill-size 16384 \ + --max-total-tokens 32768 \ + --mem-fraction-static 0.8 \ + --log-level debug + fi +elif [ "$mode" = "decode" ]; then + if [ "$cmd" = "dynamo" ]; then + # Need to increase --context-length to 10k for 8k1k benchmarking + # We are not using a init-expert-location file for e2e benchmarking + SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=768 \ + MC_TE_METRIC=true \ + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE=100000 \ + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=100000 \ + SGLANG_DISAGGREGATION_WAITING_TIMEOUT=100000 \ + SGLANG_HACK_SEQ_BOOTSTRAP_ROOM=1 \ + SGLANG_MOONCAKE_CUSTOM_MEM_POOL=True \ + NCCL_MNNVL_ENABLE=1 \ + MC_FORCE_MNNVL=1 \ + NCCL_CUMEM_ENABLE=1 \ + SGLANG_USE_MESSAGE_QUEUE_BROADCASTER=0 \ + SGL_DISABLE_TP_MEMORY_INBALANCE_CHECK=1 \ + PYTHONUNBUFFERED=1 \ + python3 components/decode_worker.py \ + --served-model-name deepseek-ai/DeepSeek-R1 \ + --model-path /model/ \ + --skip-tokenizer-init \ + --trust-remote-code \ + --disaggregation-mode decode \ + --dist-init-addr "$HOST_IP:$PORT" \ + --disaggregation-bootstrap-port 30001 \ + --nnodes "$TOTAL_NODES" \ + --node-rank "$RANK" \ + --tp-size "$TOTAL_GPUS" \ + --dp-size "$TOTAL_GPUS" \ + --enable-dp-attention \ + --host 0.0.0.0 \ + --decode-log-interval 1 \ + --max-running-requests 36864 \ + --context-length 2716 \ + --disable-radix-cache \ + --enable-deepep-moe \ + --deepep-mode low_latency \ + --moe-dense-tp-size 1 \ + --enable-dp-lm-head \ + --cuda-graph-bs 768 \ + --disable-shared-experts-fusion \ + --ep-num-redundant-experts 32 \ + --ep-dispatch-algorithm static \ + --eplb-algorithm deepseek \ + --attention-backend cutlass_mla \ + --watchdog-timeout 1000000 \ + --chunked-prefill-size 36864 \ + --mem-fraction-static 0.82 \ + --log-level debug + + elif [ "$cmd" = "sglang" ]; then + # GB200 sglang decode command + # Need to increase --context-length to 10k for 8k1k benchmarking + # We are not using a init-expert-location file for e2e benchmarking + SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=768 \ + MC_TE_METRIC=true \ + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE=100000 \ + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=100000 \ + SGLANG_DISAGGREGATION_WAITING_TIMEOUT=100000 \ + SGLANG_HACK_SEQ_BOOTSTRAP_ROOM=1 \ + SGLANG_MOONCAKE_CUSTOM_MEM_POOL=True \ + NCCL_MNNVL_ENABLE=1 \ + MC_FORCE_MNNVL=1 \ + NCCL_CUMEM_ENABLE=1 \ + SGLANG_USE_MESSAGE_QUEUE_BROADCASTER=0 \ + SGL_DISABLE_TP_MEMORY_INBALANCE_CHECK=1 \ + PYTHONUNBUFFERED=1 \ + python3 -m sglang.launch_server \ + --model-path /model/ \ + --trust-remote-code \ + --disaggregation-mode decode \ + --dist-init-addr "$HOST_IP:$PORT" \ + --disaggregation-bootstrap-port 30001 \ + --nnodes "$TOTAL_NODES" \ + --node-rank "$RANK" \ + --tp-size "$TOTAL_GPUS" \ + --dp-size "$TOTAL_GPUS" \ + --enable-dp-attention \ + --host 0.0.0.0 \ + --decode-log-interval 1 \ + --max-running-requests 36864 \ + --context-length 2716 \ + --disable-radix-cache \ + --enable-deepep-moe \ + --deepep-mode low_latency \ + --moe-dense-tp-size 1 \ + --enable-dp-lm-head \ + --cuda-graph-bs 768 \ + --disable-shared-experts-fusion \ + --ep-num-redundant-experts 32 \ + --ep-dispatch-algorithm static \ + --eplb-algorithm deepseek \ + --attention-backend cutlass_mla \ + --watchdog-timeout 1000000 \ + --chunked-prefill-size 36864 \ + --mem-fraction-static 0.82 \ + --log-level debug + fi +fi diff --git a/components/backends/sglang/slurm_jobs/scripts/h100.sh b/components/backends/sglang/slurm_jobs/scripts/h100.sh new file mode 100755 index 0000000000..b457484e3a --- /dev/null +++ b/components/backends/sglang/slurm_jobs/scripts/h100.sh @@ -0,0 +1,189 @@ +#!/bin/bash +# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 + +# Function to print usage +print_usage() { + echo "Usage: $0 " + echo " mode: prefill or decode" + echo " cmd: dynamo or sglang" + echo "" + echo "Examples:" + echo " $0 prefill dynamo" + echo " $0 decode sglang" + exit 1 +} + +# Check if correct number of arguments provided +if [ $# -ne 2 ]; then + echo "Error: Expected 2 arguments, got $#" + print_usage +fi + +# Parse arguments +mode=$1 +cmd=$2 + +# Validate mode argument +if [ "$mode" != "prefill" ] && [ "$mode" != "decode" ]; then + echo "Error: mode must be 'prefill' or 'decode', got '$mode'" + print_usage +fi + +# Validate cmd argument +if [ "$cmd" != "dynamo" ] && [ "$cmd" != "sglang" ]; then + echo "Error: cmd must be 'dynamo' or 'sglang', got '$cmd'" + print_usage +fi + +echo "Mode: $mode" +echo "Command: $cmd" + + +# Check if required environment variables are set +if [ -z "$HOST_IP" ]; then + echo "Error: HOST_IP environment variable is not set" + exit 1 +fi + +if [ -z "$PORT" ]; then + echo "Error: PORT environment variable is not set" + exit 1 +fi + +if [ -z "$TOTAL_GPUS" ]; then + echo "Error: TOTAL_GPUS environment variable is not set" + exit 1 +fi + +if [ -z "$RANK" ]; then + echo "Error: RANK environment variable is not set" + exit 1 +fi + +if [ -z "$TOTAL_NODES" ]; then + echo "Error: TOTAL_NODES environment variable is not set" + exit 1 +fi + +# Construct command based on mode and cmd +if [ "$mode" = "prefill" ]; then + if [ "$cmd" = "dynamo" ]; then + # H100 dynamo prefill command + python3 components/worker.py \ + --model-path /model/ \ + --served-model-name deepseek-ai/DeepSeek-R1 \ + --skip-tokenizer-init \ + --disaggregation-mode prefill \ + --disaggregation-transfer-backend nixl \ + --disaggregation-bootstrap-port 30001 \ + --dist-init-addr "$HOST_IP:$PORT" \ + --nnodes "$TOTAL_NODES" \ + --node-rank "$RANK" \ + --tp-size "$TOTAL_GPUS" \ + --dp-size "$TOTAL_GPUS" \ + --enable-dp-attention \ + --decode-log-interval 1 \ + --enable-deepep-moe \ + --page-size 1 \ + --trust-remote-code \ + --moe-dense-tp-size 1 \ + --enable-dp-lm-head \ + --disable-radix-cache \ + --watchdog-timeout 1000000 \ + --enable-two-batch-overlap \ + --deepep-mode normal \ + --mem-fraction-static 0.85 \ + --deepep-config /configs/deepep.json \ + --ep-num-redundant-experts 32 \ + --ep-dispatch-algorithm dynamic \ + --eplb-algorithm deepseek + elif [ "$cmd" = "sglang" ]; then + # H100 sglang prefill command + python3 -m sglang.launch_server \ + --model-path /model/ \ + --served-model-name deepseek-ai/DeepSeek-R1 \ + --disaggregation-transfer-backend nixl \ + --disaggregation-mode prefill \ + --dist-init-addr "$HOST_IP:$PORT" \ + --nnodes "$TOTAL_NODES" \ + --node-rank "$RANK" \ + --tp-size "$TOTAL_GPUS" \ + --dp-size "$TOTAL_GPUS" \ + --enable-dp-attention \ + --decode-log-interval 1 \ + --enable-deepep-moe \ + --page-size 1 \ + --host 0.0.0.0 \ + --trust-remote-code \ + --moe-dense-tp-size 1 \ + --enable-dp-lm-head \ + --disable-radix-cache \ + --watchdog-timeout 1000000 \ + --enable-two-batch-overlap \ + --deepep-mode normal \ + --mem-fraction-static 0.85 \ + --ep-num-redundant-experts 32 \ + --ep-dispatch-algorithm dynamic \ + --eplb-algorithm deepseek \ + --deepep-config /configs/deepep.json + fi +elif [ "$mode" = "decode" ]; then + if [ "$cmd" = "dynamo" ]; then + # H100 dynamo decode command + python3 components/decode_worker.py \ + --model-path /model/ \ + --served-model-name deepseek-ai/DeepSeek-R1 \ + --skip-tokenizer-init \ + --disaggregation-mode decode \ + --disaggregation-transfer-backend nixl \ + --disaggregation-bootstrap-port 30001 \ + --dist-init-addr "$HOST_IP:$PORT" \ + --nnodes "$TOTAL_NODES" \ + --node-rank "$RANK" \ + --tp-size "$TOTAL_GPUS" \ + --dp-size "$TOTAL_GPUS" \ + --enable-dp-attention \ + --decode-log-interval 1 \ + --enable-deepep-moe \ + --page-size 1 \ + --trust-remote-code \ + --moe-dense-tp-size 1 \ + --enable-dp-lm-head \ + --disable-radix-cache \ + --watchdog-timeout 1000000 \ + --enable-two-batch-overlap \ + --deepep-mode low_latency \ + --mem-fraction-static 0.835 \ + --ep-num-redundant-experts 32 \ + --cuda-graph-bs 256 + elif [ "$cmd" = "sglang" ]; then + # H100 sglang decode command + python3 -m sglang.launch_server \ + --model-path /model/ \ + --disaggregation-transfer-backend nixl \ + --disaggregation-mode decode \ + --dist-init-addr "$HOST_IP:$PORT" \ + --nnodes "$TOTAL_NODES" \ + --node-rank "$RANK" \ + --tp-size "$TOTAL_GPUS" \ + --dp-size "$TOTAL_GPUS" \ + --enable-dp-attention \ + --decode-log-interval 1 \ + --enable-deepep-moe \ + --page-size 1 \ + --host 0.0.0.0 \ + --trust-remote-code \ + --moe-dense-tp-size 1 \ + --enable-dp-lm-head \ + --disable-radix-cache \ + --watchdog-timeout 1000000 \ + --enable-two-batch-overlap \ + --deepep-mode low_latency \ + --mem-fraction-static 0.835 \ + --ep-num-redundant-experts 32 \ + --cuda-graph-bs 256 + fi +fi + + diff --git a/components/backends/sglang/slurm_jobs/scripts/worker_setup.py b/components/backends/sglang/slurm_jobs/scripts/worker_setup.py index db6ac88531..cfe2aaa634 100644 --- a/components/backends/sglang/slurm_jobs/scripts/worker_setup.py +++ b/components/backends/sglang/slurm_jobs/scripts/worker_setup.py @@ -8,8 +8,8 @@ The script will: - Setup the environment -- Update the YAML config file -- Start Dynamo graphs.disagg service +- Generate the python3 command to run the prefill or decode worker +- Start dynamo (or sglang) - Monitor the GPU utilization """ @@ -165,6 +165,19 @@ def _parse_command_line_args(args: list[str] | None = None) -> argparse.Namespac default=None, help="File to log GPU utilization (default: None)", ) + parser.add_argument( + "--use-sglang-commands", + action="store_true", + default=False, + help="Helper to spin up SGLang servers instead of dynamo. This is helpful for benchmarking SGLang as well", + ) + parser.add_argument( + "--gpu_type", + type=str, + choices=["h100", "gb200"], + default="h100", + help="Type of GPU to use", + ) return parser.parse_args(args) @@ -181,73 +194,114 @@ def _validate_args(args: argparse.Namespace) -> None: raise ValueError("GPUs per node must be at least 1") -def setup_prefill_node( - rank: int, prefill_host_ip: str, total_nodes: int, total_gpus: int -) -> int: +def get_sglang_mini_lb_command_args(prefill_host_ip: str, decode_host_ip: str) -> str: + cmd = ( + f"python3 -m sglang.srt.disaggregation.launch_lb " + f"--prefill http://{prefill_host_ip}:30000 " + f"--decode http://{decode_host_ip}:30000 " + "--host 0.0.0.0 " + "--port 8000 " + "--timeout 3600" + ) + return cmd + + +def setup_env_vars_for_gpu_script( + host_ip: str, + rank: int, + total_gpus: int, + total_nodes: int, + port: int = DIST_INIT_PORT, +): + """Setup environment variables required by GPU scripts (h100.sh, gb200.sh)""" + os.environ["HOST_IP"] = host_ip + os.environ["PORT"] = str(port) + os.environ["TOTAL_GPUS"] = str(total_gpus) + os.environ["RANK"] = str(rank) + os.environ["TOTAL_NODES"] = str(total_nodes) + + logging.info(f"Set HOST_IP: {host_ip}") + logging.info(f"Set PORT: {port}") + logging.info(f"Set TOTAL_GPUS: {total_gpus}") + logging.info(f"Set RANK: {rank}") + logging.info(f"Set TOTAL_NODES: {total_nodes}") + + +def get_gpu_command(worker_type: str, use_sglang_commands: bool, gpu_type: str) -> str: + """Generate command to run the appropriate GPU script""" + script_name = f"{gpu_type}.sh" + script_path = Path(__file__).parent / script_name + mode = worker_type # "prefill" or "decode" + cmd = "sglang" if use_sglang_commands else "dynamo" + + return f"bash {script_path} {mode} {cmd}" + + +def setup_head_prefill_node(prefill_host_ip: str) -> None: """ - Setup the prefill node. + Setup NATS, etcd, ingress, and http servers on the prefill host node. """ - if rank == 0: - logging.info(f"Setting up host prefill node: {rank}") - logging.info(f"Starting nats server on node {rank} with IP {prefill_host_ip}") - - nats_process = run_command("nats-server -js", background=True) - if not nats_process: - raise RuntimeError("Failed to start nats-server") - - etcd_cmd = ( - f"etcd --listen-client-urls {ETCD_LISTEN_ADDR}:{ETCD_CLIENT_PORT} " - f"--advertise-client-urls {ETCD_LISTEN_ADDR}:{ETCD_CLIENT_PORT} " - f"--listen-peer-urls {ETCD_LISTEN_ADDR}:{ETCD_PEER_PORT} " - f"--initial-cluster default=http://{prefill_host_ip}:{ETCD_PEER_PORT}" - ) + logging.info(f"Starting nats server on node {prefill_host_ip}") + + nats_process = run_command("nats-server -js", background=True) + if not nats_process: + raise RuntimeError("Failed to start nats-server") + + logging.info(f"Starting etcd server on node {prefill_host_ip}") + etcd_cmd = ( + f"etcd --listen-client-urls {ETCD_LISTEN_ADDR}:{ETCD_CLIENT_PORT} " + f"--advertise-client-urls {ETCD_LISTEN_ADDR}:{ETCD_CLIENT_PORT} " + f"--listen-peer-urls {ETCD_LISTEN_ADDR}:{ETCD_PEER_PORT} " + f"--initial-cluster default=http://{prefill_host_ip}:{ETCD_PEER_PORT}" + ) - etcd_process = run_command(etcd_cmd, background=True) - if not etcd_process: - raise RuntimeError("Failed to start etcd") + etcd_process = run_command(etcd_cmd, background=True) + if not etcd_process: + raise RuntimeError("Failed to start etcd") - ingress_process = run_command("dynamo run in=http out=dyn", background=True) - if not ingress_process: - raise RuntimeError("Failed to start ingress") + logging.info(f"Starting ingress server on node {prefill_host_ip}") + ingress_process = run_command( + "dynamo run in=http out=dyn --http-port=8000", background=True + ) + if not ingress_process: + raise RuntimeError("Failed to start ingress") + + logging.info( + f"Starting http server on port 9001 for flush_cache endpoint on node {prefill_host_ip}" + ) + cache_flush_server_cmd = "python3 utils/sgl_http_server.py --ns dynamo" + cache_flush_server_process = run_command(cache_flush_server_cmd, background=True) + if not cache_flush_server_process: + raise RuntimeError("Failed to start cache flush server") + +def setup_prefill_node( + rank: int, + prefill_host_ip: str, + total_nodes: int, + total_gpus: int, + use_sglang_commands: bool, + gpu_type: str, +) -> int: + """ + Setup the prefill node. + """ + if not use_sglang_commands: + if rank == 0: + setup_head_prefill_node(prefill_host_ip) + else: + logging.info(f"Setting up child prefill node: {rank}") + if not wait_for_etcd(f"http://{prefill_host_ip}:{ETCD_CLIENT_PORT}"): + raise RuntimeError("Failed to connect to etcd") else: - logging.info(f"Setting up child prefill node: {rank}") - if not wait_for_etcd(f"http://{prefill_host_ip}:{ETCD_CLIENT_PORT}"): - raise RuntimeError("Failed to connect to etcd") + logging.info("Using SGLang servers. No need to setup etcd or nats") - # NOTE: This implements the example in examples/sglang/dsr1-wideep.md - # For other examples, the command might have to be modified. - dynamo_cmd = ( - f"python3 -m dynamo.sglang.worker " - "--model-path /model/ " - "--served-model-name deepseek-ai/DeepSeek-R1 " - "--skip-tokenizer-init " - "--disaggregation-mode prefill " - "--disaggregation-transfer-backend nixl " - "--disaggregation-bootstrap-port 30001 " - f"--dist-init-addr {prefill_host_ip}:{DIST_INIT_PORT} " - f"--nnodes {total_nodes} " - f"--node-rank {rank} " - f"--tp-size {total_gpus} " - f"--dp-size {total_gpus} " - "--enable-dp-attention " - "--decode-log-interval 1 " - "--enable-deepep-moe " - "--page-size 1 " - "--trust-remote-code " - "--moe-dense-tp-size 1 " - "--enable-dp-lm-head " - "--disable-radix-cache " - "--watchdog-timeout 1000000 " - "--enable-two-batch-overlap " - "--deepep-mode normal " - "--mem-fraction-static 0.85 " - "--deepep-config /configs/deepep.json " - "--ep-num-redundant-experts 32 " - "--ep-dispatch-algorithm dynamic " - "--eplb-algorithm deepseek " - ) - return run_command(dynamo_cmd) + # Setup environment variables for GPU script + setup_env_vars_for_gpu_script(prefill_host_ip, rank, total_gpus, total_nodes) + + # Use appropriate GPU script instead of generating command directly + cmd_to_run = get_gpu_command("prefill", use_sglang_commands, gpu_type) + return run_command(cmd_to_run) def setup_decode_node( @@ -256,45 +310,29 @@ def setup_decode_node( prefill_host_ip: str, total_nodes: int, total_gpus: int, + use_sglang_commands: bool, + gpu_type: str, ) -> int: """ Setup the decode node. """ logging.info(f"Setting up child decode node: {rank}") - if not wait_for_etcd(f"http://{prefill_host_ip}:{ETCD_CLIENT_PORT}"): - raise RuntimeError("Failed to connect to etcd") - - dynamo_cmd = ( - "python3 -m dynamo.sglang.decode_worker " - "--model-path /model/ " - "--served-model-name deepseek-ai/DeepSeek-R1 " - "--skip-tokenizer-init " - "--disaggregation-mode decode " - "--disaggregation-transfer-backend nixl " - "--disaggregation-bootstrap-port 30001 " - f"--dist-init-addr {decode_host_ip}:{DIST_INIT_PORT} " - f"--nnodes {total_nodes} " - f"--node-rank {rank} " - f"--tp-size {total_gpus} " - f"--dp-size {total_gpus} " - "--enable-dp-attention " - "--decode-log-interval 1 " - "--enable-deepep-moe " - "--page-size 1 " - "--trust-remote-code " - "--moe-dense-tp-size 1 " - "--enable-dp-lm-head " - "--disable-radix-cache " - "--watchdog-timeout 1000000 " - "--enable-two-batch-overlap " - "--deepep-mode low_latency " - "--mem-fraction-static 0.835 " - "--ep-num-redundant-experts 32 " - "--cuda-graph-bs 256 " - ) + if use_sglang_commands: + sgl_mini_lb_cmd = get_sglang_mini_lb_command_args( + prefill_host_ip, decode_host_ip + ) + run_command(sgl_mini_lb_cmd, background=True) + else: + if not wait_for_etcd(f"http://{prefill_host_ip}:{ETCD_CLIENT_PORT}"): + raise RuntimeError("Failed to connect to etcd") + + # Setup environment variables for GPU script + setup_env_vars_for_gpu_script(decode_host_ip, rank, total_gpus, total_nodes) - return run_command(dynamo_cmd) + # Use appropriate GPU script instead of generating command directly + cmd_to_run = get_gpu_command("decode", use_sglang_commands, gpu_type) + return run_command(cmd_to_run) def setup_env(prefill_host_ip: str): @@ -321,6 +359,7 @@ def main(input_args: list[str] | None = None): logging.info(f"Prefill host IP: {args.prefill_host_ip}") logging.info(f"Decode host IP: {args.decode_host_ip}") logging.info(f"Rank: {args.rank}") + logging.info(f"Use SGLang commands: {args.use_sglang_commands}") setup_env(args.prefill_host_ip) if args.worker_type == "prefill": @@ -329,6 +368,8 @@ def main(input_args: list[str] | None = None): args.prefill_host_ip, args.total_nodes, args.total_nodes * args.gpus_per_node, + args.use_sglang_commands, + args.gpu_type, ) else: setup_decode_node( @@ -337,6 +378,8 @@ def main(input_args: list[str] | None = None): args.prefill_host_ip, args.total_nodes, args.total_nodes * args.gpus_per_node, + args.use_sglang_commands, + args.gpu_type, ) logging.info(f"{args.worker_type.capitalize()} node setup complete") diff --git a/components/backends/sglang/slurm_jobs/submit_job_script.py b/components/backends/sglang/slurm_jobs/submit_job_script.py index 64f492224e..ee386929c8 100644 --- a/components/backends/sglang/slurm_jobs/submit_job_script.py +++ b/components/backends/sglang/slurm_jobs/submit_job_script.py @@ -86,7 +86,7 @@ def _parse_command_line_args(args: list[str] | None = None) -> argparse.Namespac parser.add_argument("--config-dir", required=True, help="Config directory path") parser.add_argument("--container-image", required=True, help="Container image") parser.add_argument( - "--time-limit", default="01:00:00", help="Time limit (HH:MM:SS)" + "--time-limit", default="04:00:00", help="Time limit (HH:MM:SS)" ) parser.add_argument( "--prefill-nodes", type=int, default=2, help="Number of prefill nodes" @@ -100,6 +100,20 @@ def _parse_command_line_args(args: list[str] | None = None) -> argparse.Namespac parser.add_argument( "--network-interface", default="eth3", help="Network interface to use" ) + parser.add_argument( + "--gpu-type", choices=["h100", "gb200"], default="h100", help="GPU type to use" + ) + parser.add_argument( + "--use-sglang-commands", + action="store_true", + default=False, + help="Use SGLang commands instead of Dynamo", + ) + parser.add_argument( + "--partition", + default="batch", + help="SLURM partition to use", + ) return parser.parse_args(args) @@ -120,6 +134,9 @@ def main(input_args: list[str] | None = None): "container_image": args.container_image, "gpus_per_node": args.gpus_per_node, "network_interface": args.network_interface, + "gpu_type": args.gpu_type, + "use_sglang_commands": args.use_sglang_commands, + "partition": args.partition, } with tempfile.NamedTemporaryFile(mode="w", suffix=".sh") as temp_file: diff --git a/container/Dockerfile.sglang-wideep b/container/Dockerfile.sglang-wideep index 8cc05aa408..891cd20c5e 100644 --- a/container/Dockerfile.sglang-wideep +++ b/container/Dockerfile.sglang-wideep @@ -13,160 +13,132 @@ # See the License for the specific language governing permissions and # limitations under the License. -# This should be pinned to the sglang version that is installed with Dynamo -# in the pyproject.toml -FROM lmsysorg/sglang:v0.4.8.post1-cu126 +ARG SGLANG_IMAGE_TAG="v0.4.10-cu126" -# Add NIXL build dependencies -RUN apt-get update -y && \ - apt-get install -y \ - cmake \ - meson \ - ninja-build \ - pybind11-dev \ - patchelf \ - net-tools - -# Install Python build dependencies -RUN pip install --break-system-packages meson-python wheel build +FROM lmsysorg/sglang:${SGLANG_IMAGE_TAG} -# Add architecture args for NIXL build -ARG ARCH=amd64 -ARG ARCH_ALT=x86_64 - -WORKDIR /sgl-workspace +ARG MODE="hopper" +ARG ARCH="amd64" +ARG ARCH_ALT="x86_64" +ARG NIXL_UCX_REF="v1.19.x" +ARG NIXL_TAG="0.4.1" +ARG CMAKE_VERSION="3.31.8" +ARG RUST_VERSION="1.87.0" +ARG CARGO_BUILD_JOBS="16" -# Install UCX dependencies RUN apt-get update -y && \ - apt-get install -y --no-install-recommends \ - --reinstall libibverbs-dev rdma-core ibverbs-utils libibumad-dev \ - libnuma-dev librdmacm-dev ibverbs-providers \ - autoconf libtool - -# Build UCX from source -ARG NIXL_UCX_REF=v1.19.x -RUN rm -rf /opt/hpcx/ucx && \ - rm -rf /usr/local/ucx && \ - cd /usr/local/src && \ - git clone https://github.com/openucx/ucx.git && \ - cd ucx && \ - git checkout $NIXL_UCX_REF && \ - ./autogen.sh && ./configure \ - --prefix=/usr/local/ucx \ - --enable-shared \ - --disable-static \ - --disable-doxygen-doc \ - --enable-optimizations \ - --enable-cma \ - --enable-devel-headers \ - --with-cuda=/usr/local/cuda \ - --with-verbs \ - --with-efa \ - --with-dm \ - --with-gdrcopy=/usr/local \ - --enable-mt && \ - make -j && \ - make -j install-strip && \ - ldconfig + apt-get install -y \ + cmake meson ninja-build pybind11-dev patchelf net-tools \ + build-essential protobuf-compiler libssl-dev pkg-config \ + clang libclang-dev git rapidjson-dev zlib1g-dev && \ + pip install --break-system-packages meson-python wheel build + +# Build UCX + NIXL for x86/hopper until its fully tested on GB200 +RUN if [ "$MODE" = "hopper" ]; then \ + apt-get install -y --no-install-recommends \ + libibverbs-dev rdma-core ibverbs-utils libibumad-dev \ + libnuma-dev librdmacm-dev ibverbs-providers autoconf libtool && \ + # UCX from source + rm -rf /opt/hpcx/ucx /usr/local/ucx && \ + cd /usr/local/src && \ + git clone https://github.com/openucx/ucx.git && \ + cd ucx && git checkout $NIXL_UCX_REF && \ + ./autogen.sh && \ + ./configure \ + --prefix=/usr/local/ucx \ + --enable-shared \ + --disable-static \ + --disable-doxygen-doc \ + --enable-optimizations \ + --enable-cma \ + --enable-devel-headers \ + --with-cuda=/usr/local/cuda \ + --with-verbs \ + --with-efa \ + --with-dm \ + --with-gdrcopy=/usr/local \ + --enable-mt && \ + make -j && make install-strip && ldconfig && \ + # NIXL + git clone https://github.com/ai-dynamo/nixl.git /opt/nixl && \ + cd /opt/nixl && git checkout $NIXL_TAG && \ + pip install --break-system-packages . \ + --config-settings="setup-args=-Ducx_path=/usr/local/ucx"; \ + fi ENV LD_LIBRARY_PATH=/usr/lib:/usr/local/ucx/lib:$LD_LIBRARY_PATH -ARG NIXL_TAG=0.4.1 -RUN git clone https://github.com/ai-dynamo/nixl.git && cd nixl && git checkout ${NIXL_TAG} && pip install --break-system-packages . --config-settings=setup-args="-Ducx_path=/usr/local/ucx" - -WORKDIR /sgl-workspace - -# Allow forceful shutdown of inflight requests -ENV SGL_FORCE_SHUTDOWN=1 - +# Dynamo WORKDIR /sgl-workspace RUN git clone https://github.com/ai-dynamo/dynamo.git -# install dynamo in editable mode -WORKDIR /sgl-workspace/dynamo -# Rust build/dev dependencies -RUN apt update -y && \ - apt install --no-install-recommends -y \ - build-essential \ - protobuf-compiler \ - cmake \ - libssl-dev \ - pkg-config \ - clang \ - libclang-dev \ - git - -# Define Rust target based on ARCH_ALT ARG -ARG RUSTARCH=${ARCH_ALT}-unknown-linux-gnu - ENV RUSTUP_HOME=/usr/local/rustup \ CARGO_HOME=/usr/local/cargo \ - PATH=/usr/local/cargo/bin:$PATH \ - RUST_VERSION=1.86.0 + PATH=/usr/local/cargo/bin:$PATH -# Install Rust using RUSTARCH derived from ARCH_ALT -RUN wget --tries=3 --waitretry=5 "https://static.rust-lang.org/rustup/archive/1.28.1/${RUSTARCH}/rustup-init" && \ - # TODO: Add SHA check back based on RUSTARCH +RUN wget --tries=3 --waitretry=5 \ + "https://static.rust-lang.org/rustup/archive/1.28.1/${ARCH_ALT}-unknown-linux-gnu/rustup-init" && \ chmod +x rustup-init && \ - ./rustup-init -y --no-modify-path --profile minimal --default-toolchain $RUST_VERSION --default-host ${RUSTARCH} && \ + ./rustup-init -y \ + --no-modify-path \ + --profile minimal \ + --default-toolchain $RUST_VERSION \ + --default-host ${ARCH_ALT}-unknown-linux-gnu && \ rm rustup-init && \ chmod -R a+w $RUSTUP_HOME $CARGO_HOME ARG CARGO_BUILD_JOBS -# Set CARGO_BUILD_JOBS to 16 if not provided -# This is to prevent cargo from building $(nproc) jobs in parallel, -# which might exceed the number of opened files limit. -ENV CARGO_BUILD_JOBS=${CARGO_BUILD_JOBS:-16} +ENV CARGO_BUILD_JOBS=${CARGO_BUILD_JOBS} + +RUN cd dynamo && cargo build --release -RUN cargo build --release +RUN cd dynamo/lib/bindings/python && \ + pip install --break-system-packages -e . && \ + cd /sgl-workspace/dynamo && \ + pip install --break-system-packages . -RUN cd lib/bindings/python && pip install --break-system-packages -e . && cd ../../.. -RUN pip install --break-system-packages . +RUN pip install --break-system-packages sglang-router==0.1.5 -RUN wget --tries=3 --waitretry=5 https://github.com/nats-io/nats-server/releases/download/v2.10.28/nats-server-v2.10.28-${ARCH}.deb && \ +RUN wget --tries=3 --waitretry=5 \ + https://github.com/nats-io/nats-server/releases/download/v2.10.28/\ +nats-server-v2.10.28-${ARCH}.deb && \ dpkg -i nats-server-v2.10.28-${ARCH}.deb && rm nats-server-v2.10.28-${ARCH}.deb ENV ETCD_VERSION="v3.5.21" -RUN wget --tries=3 --waitretry=5 https://github.com/etcd-io/etcd/releases/download/$ETCD_VERSION/etcd-$ETCD_VERSION-linux-${ARCH}.tar.gz -O /tmp/etcd.tar.gz && \ +RUN wget --tries=3 --waitretry=5 \ + https://github.com/etcd-io/etcd/releases/download/${ETCD_VERSION}/\ +etcd-${ETCD_VERSION}-linux-${ARCH}.tar.gz -O /tmp/etcd.tar.gz && \ mkdir -p /usr/local/bin/etcd && \ - tar -xvf /tmp/etcd.tar.gz -C /usr/local/bin/etcd --strip-components=1 && \ + tar -xzf /tmp/etcd.tar.gz \ + -C /usr/local/bin/etcd --strip-components=1 && \ rm /tmp/etcd.tar.gz -ENV PATH=/usr/local/bin/etcd/:$PATH -ARG CMAKE_VERSION=3.31.8 -RUN mkdir /sgl-workspace/cmake_build -WORKDIR /sgl-workspace/cmake_build +ENV PATH=/usr/local/bin/etcd:$PATH -# uninstall CMake +# GenAI Perf RUN apt-get purge -y cmake -# download newer version of CMake -RUN wget https://github.com/Kitware/CMake/releases/download/v${CMAKE_VERSION}/cmake-${CMAKE_VERSION}-linux-$(uname -m).tar.gz && \ - tar -xvzf cmake-${CMAKE_VERSION}-linux-$(uname -m).tar.gz && \ - mv cmake-${CMAKE_VERSION}-linux-$(uname -m) custom_cmake -ENV PATH=/sgl-workspace/cmake_build/custom_cmake/bin:$PATH -# should be 3.31.8 +RUN mkdir /sgl-workspace/cmake_build && \ + cd /sgl-workspace/cmake_build && \ + wget https://github.com/Kitware/CMake/releases/download/v${CMAKE_VERSION}/\ +cmake-${CMAKE_VERSION}-linux-$(uname -m).tar.gz && \ + tar -xzf cmake-${CMAKE_VERSION}-linux-$(uname -m).tar.gz && \ + mv cmake-${CMAKE_VERSION}-linux-$(uname -m) custom_cmake && \ + rm cmake-${CMAKE_VERSION}-linux-$(uname -m).tar.gz + +ENV PATH=/sgl-workspace/cmake_build/custom_cmake/bin:$PATH RUN cmake --version -# Install perf_analyzer and genai-perf -RUN apt-get update -y && \ - apt-get install -y --no-install-recommends \ - rapidjson-dev \ - # jq and curl for polling various endpoints and health checks - jq \ - curl \ - zlib1g-dev - -RUN git clone --depth=1 https://github.com/triton-inference-server/perf_analyzer.git && \ +RUN git clone --depth=1 \ + https://github.com/triton-inference-server/perf_analyzer.git && \ mkdir perf_analyzer/build && \ cmake -B perf_analyzer/build -S perf_analyzer && \ - cmake --build perf_analyzer/build -- -j8 + cmake --build perf_analyzer/build -- -j$(nproc) ENV PATH=/sgl-workspace/perf_analyzer/build/perf_analyzer/src/perf-analyzer-build:$PATH - RUN pip install --break-system-packages genai-perf -# https://pypi.org/project/sglang-router/0.1.5 is latest -RUN pip install sglang-router==0.1.5 +# Enable forceful shutdown of inflight requests +ENV SGL_FORCE_SHUTDOWN=1 WORKDIR /sgl-workspace/dynamo/components/backends/sglang