From f404ab0a4291815ca708595da8b1716ef71ab988 Mon Sep 17 00:00:00 2001 From: ishandhanani Date: Thu, 31 Jul 2025 12:14:06 -0700 Subject: [PATCH 1/9] docs(sglang): add GB200 deployment guide for DeepSeek-R1 with WideEP --- .../backends/sglang/docs/dsr1-wideep-gb200.md | 172 +++++++++++++++ .../backends/sglang/docs/dsr1-wideep-h100.md | 12 +- container/Dockerfile.sglang-wideep | 207 ++++++++---------- 3 files changed, 269 insertions(+), 122 deletions(-) create mode 100644 components/backends/sglang/docs/dsr1-wideep-gb200.md diff --git a/components/backends/sglang/docs/dsr1-wideep-gb200.md b/components/backends/sglang/docs/dsr1-wideep-gb200.md new file mode 100644 index 0000000000..ea987fae0f --- /dev/null +++ b/components/backends/sglang/docs/dsr1-wideep-gb200.md @@ -0,0 +1,172 @@ + + +# Running DeepSeek-R1 Disaggregated with WideEP on GB200s + +Dynamo supports SGLang's GB200 implementation of wide expert parallelism and large scale P/D for DeepSeek-R1! You can read their blog post [here](https://lmsys.org/blog/2025-06-16-gb200-part-1/) for more details. Full end to end optimization is still a work in progress but you can get this up and running with the following steps. In ths example, we will run 1 prefill worker on 2 GB200 nodes (4 GPUs each) and 1 decode worker on 12 GB200 nodes (total 56 GPUs). + +## Instructions + +1. Build the Dynamo container + +```bash +cd $DYNAMO_ROOT +docker build \ + -f container/Dockerfile.sglang-wideep \ + -t dynamo-wideep-gb200 \ + --build-arg MODE=blackwell \ + --build-arg SGLANG_IMAGE_TAG=v0.4.9.post6-cu128-gb200 \ + --build-arg ARCH=arm64 \ + --build-arg ARCH_ALT=aarch64 \ + . \ + --no-cache +``` + +2. You can run this container on each 4xGB200 node using the following command. + +> [!IMPORTANT] +> We recommend downloading DeepSeek-R1 and then mounting it to the container. You can find the model [here](https://huggingface.co/deepseek-ai/DeepSeek-R1) + +```bash +docker run \ + --gpus all \ + -it \ + --rm \ + --network host \ + --volume /PATH_TO_DSR1_MODEL/:/model/ \ + --shm-size=10G \ + --ulimit memlock=-1 \ + --ulimit stack=67108864 \ + --ulimit nofile=65536:65536 \ + --cap-add CAP_SYS_PTRACE \ + --ipc host \ + dynamo-wideep-gb200:latest +``` + +3. On the head prefill node, run the helper script provided to generate commands to start the `nats-server`, `etcd`. This script will also tell you which environment variables to export on each node to make deployment easier. + +```bash +./utils/gen_env_vars.sh +``` + +4. Run the ingress and prefill worker + +```bash +# run ingress +python3 -m dynamo.frontend --http-port=8000 & +# optionally run the http server that allows you to flush the kv cache for all workers (see benchmarking section below) +python3 utils/sgl_http_server.py --ns dynamo & +# run prefill worker +SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=2048 \ +MC_TE_METRIC=true \ +SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE=100000 \ +SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=100000 \ +SGLANG_DISAGGREGATION_WAITING_TIMEOUT=100000 \ +SGLANG_MOONCAKE_CUSTOM_MEM_POOL=True \ +MC_FORCE_MNNVL=1 \ +NCCL_MNNVL_ENABLE=1 \ +NCCL_CUMEM_ENABLE=1 \ +SGLANG_USE_MESSAGE_QUEUE_BROADCASTER=0 \ +SGL_DISABLE_TP_MEMORY_INBALANCE_CHECK=1 \ +PYTHONUNBUFFERED=1 \ +python3 components/worker.py \ + --served-model-name deepseek-ai/DeepSeek-R1 \ + --model-path /model/ \ + --skip-tokenizer-init \ + --trust-remote-code \ + --disaggregation-mode prefill \ + --dist-init-addr ${HEAD_PREFILL_NODE_IP}:29500 \ + --disaggregation-bootstrap-port 30001 \ + --disaggregation-transfer-backend nixl \ + --nnodes 2 \ + --node-rank 0 \ + --tp-size 8 \ + --dp-size 8 \ + --enable-dp-attention \ + --host 0.0.0.0 \ + --decode-log-interval 1 \ + --max-running-requests 6144 \ + --context-length 2716 \ + --disable-radix-cache \ + --enable-deepep-moe \ + --deepep-mode low_latency \ + --moe-dense-tp-size 1 \ + --enable-dp-lm-head \ + --disable-shared-experts-fusion \ + --ep-num-redundant-experts 32 \ + --ep-dispatch-algorithm static \ + --eplb-algorithm deepseek \ + --attention-backend cutlass_mla \ + --watchdog-timeout 1000000 \ + --disable-cuda-graph \ + --chunked-prefill-size 16384 \ + --max-total-tokens 32768 \ + --mem-fraction-static 0.8 \ + --log-level debug +``` + +5. Run the decode worker on the head decode node + +```bash +SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=768 \ +MC_TE_METRIC=true \ +SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE=100000 \ +SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=100000 \ +SGLANG_DISAGGREGATION_WAITING_TIMEOUT=100000 \ +SGLANG_HACK_SEQ_BOOTSTRAP_ROOM=1 \ +SGLANG_MOONCAKE_CUSTOM_MEM_POOL=True \ +NCCL_MNNVL_ENABLE=1 \ +MC_FORCE_MNNVL=1 \ +NCCL_CUMEM_ENABLE=1 \ +SGLANG_USE_MESSAGE_QUEUE_BROADCASTER=0 \ +SGL_DISABLE_TP_MEMORY_INBALANCE_CHECK=1 \ +PYTHONUNBUFFERED=1 \ +python3 components/decode_worker.py \ + --served-model-name deepseek-ai/DeepSeek-R1 \ + --model-path /model/ \ + --skip-tokenizer-init \ + --trust-remote-code \ + --disaggregation-mode decode \ + --dist-init-addr ${HEAD_DECODE_NODE_IP}:29500 \ + --disaggregation-bootstrap-port 30001 \ + --nnodes 12 \ + --node-rank 0 \ + --tp-size 48 \ + --dp-size 48 \ + --enable-dp-attention \ + --host 0.0.0.0 \ + --decode-log-interval 1 \ + --max-running-requests 36864 \ + --context-length 2716 \ + --disable-radix-cache \ + --enable-deepep-moe \ + --deepep-mode low_latency \ + --moe-dense-tp-size 1 \ + --enable-dp-lm-head \ + --cuda-graph-bs 768 \ + --disable-shared-experts-fusion \ + --ep-num-redundant-experts 32 \ + --ep-dispatch-algorithm static \ + --eplb-algorithm deepseek \ + --attention-backend cutlass_mla \ + --watchdog-timeout 1000000 \ + --chunked-prefill-size 36864 \ + --mem-fraction-static 0.82 \ + --log-level debug +``` + +On the other decode nodes (this example has 12 total decode nodes), run the same command but change `--node-rank` to 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11 diff --git a/components/backends/sglang/docs/dsr1-wideep-h100.md b/components/backends/sglang/docs/dsr1-wideep-h100.md index a23a3ada13..326650de74 100644 --- a/components/backends/sglang/docs/dsr1-wideep-h100.md +++ b/components/backends/sglang/docs/dsr1-wideep-h100.md @@ -57,7 +57,7 @@ In each container, you should be in the `/sgl-workspace/dynamo/components/backen ```bash # run ingress -dynamo run in=http out=dyn & +python3 -m dynamo.frontend --http-port=8000 & # optionally run the http server that allows you to flush the kv cache for all workers (see benchmarking section below) python3 utils/sgl_http_server.py --ns dynamo & # run prefill worker @@ -93,7 +93,7 @@ python3 -m dynamo.sglang.worker \ On the other prefill node (since this example has 4 total prefill nodes), run the same command but change `--node-rank` to 1,2, and 3 -7. Run the decode worker on the head decode node +6. Run the decode worker on the head decode node ```bash python3 -m dynamo.sglang.decode_worker \ @@ -131,6 +131,7 @@ On the other decode nodes (this example has 9 total decode nodes), run the same In the official [blog post repro instructions](https://github.com/sgl-project/sglang/issues/6017), SGL uses batch inference to benchmark their prefill and decode workers. They do this by pretokenizing the ShareGPT dataset and then creating a batch of 8192 requests with ISL 4096 and OSL 5 (for prefill stress test) and a batch of 40000 with ISL 2000 and OSL 100 (for decode stress test). If you want to repro these benchmarks, you will need to add the following flags to the prefill and decode commands: prefill: + ```bash ... --max-running-requests 8192 \ @@ -142,6 +143,7 @@ prefill: ``` decode: + ```bash ... --max-running-requests 18432 \ @@ -152,9 +154,10 @@ decode: We currently provide 2 different ways to perform an end to end benchmark which includes using our OpenAI frontend and tokenization. We will continue to add better support for these sorts of large single batch workloads in the future. 1. **GenAI Perf to benchmark end to end performance with 8k ISL 256 OSL** -We've found that 8k ISL 256 OSL provides a good baseline for measuring end to end disaggregated serving performance for DSR1. As WideEP allows for a higher throughput, we provide a script that runs this workload at high concurrencies. DeepGEMM kernels can sometimes take a while to warm up. We provide a short ramping warmup script that can be used. + We've found that 8k ISL 256 OSL provides a good baseline for measuring end to end disaggregated serving performance for DSR1. As WideEP allows for a higher throughput, we provide a script that runs this workload at high concurrencies. DeepGEMM kernels can sometimes take a while to warm up. We provide a short ramping warmup script that can be used. Example usage: + ```bash # warmup ./utils/bench.sh HEAD_PREFILL_NODE_IP --type warmup @@ -165,9 +168,10 @@ curl -X POST http://${HEAD_PREFILL_NODE_IP}:9001/flush_cache ``` 2. **GenAI Perf to benchmark completions with custom dataset** -We provide a script that generates a JSONL file of the ShareGPT dataset and then use GenAI Perf to benchmark the prefill and decode workers. We use ShareGPT in order to leverage the pre-existing EPLB distributions provided by the SGLang team. If you don't want to use ShareGPT - you can also use GenAIPerf's synthetic dataset setup But note you will have to use dynamic EPLB configurations or record your own as the `init-expert-location` provided by SGLang is tuned specifically for the ShareGPT dataset at a 4096 ISL and 5 OSL. + We provide a script that generates a JSONL file of the ShareGPT dataset and then use GenAI Perf to benchmark the prefill and decode workers. We use ShareGPT in order to leverage the pre-existing EPLB distributions provided by the SGLang team. If you don't want to use ShareGPT - you can also use GenAIPerf's synthetic dataset setup But note you will have to use dynamic EPLB configurations or record your own as the `init-expert-location` provided by SGLang is tuned specifically for the ShareGPT dataset at a 4096 ISL and 5 OSL. Example usage: + ```bash # generate data python3 src/dynamo/sglang/utils/generate_bench_data.py --output data.jsonl --num-prompts 8192 --input-len 4096 --output-len 5 --model deepseek-ai/DeepSeek-R1 diff --git a/container/Dockerfile.sglang-wideep b/container/Dockerfile.sglang-wideep index e6aa11092f..dfcc0090ac 100644 --- a/container/Dockerfile.sglang-wideep +++ b/container/Dockerfile.sglang-wideep @@ -13,160 +13,131 @@ # See the License for the specific language governing permissions and # limitations under the License. -# This should be pinned to the sglang version that is installed with Dynamo -# in the pyproject.toml -FROM lmsysorg/sglang:v0.4.8.post1-cu126 +FROM lmsysorg/sglang:${SGLANG_IMAGE_TAG} + +ARG MODE="hopper" +ARG SGLANG_IMAGE_TAG="v0.4.8.post1-cu126" +ARG ARCH="amd64" +ARG ARCH_ALT="x86_64" +ARG NIXL_UCX_REF="v1.19.x" +ARG NIXL_TAG="0.4.1" +ARG CMAKE_VERSION="3.31.8" +ARG RUST_VERSION="1.87.0" +ARG CARGO_BUILD_JOBS="16" -# Add NIXL build dependencies RUN apt-get update -y && \ apt-get install -y \ - cmake \ - meson \ - ninja-build \ - pybind11-dev \ - patchelf \ - net-tools - -# Install Python build dependencies -RUN pip install --break-system-packages meson-python wheel build - -# Add architecture args for NIXL build -ARG ARCH=amd64 -ARG ARCH_ALT=x86_64 - -WORKDIR /sgl-workspace - -# Install UCX dependencies -RUN apt-get update -y && \ - apt-get install -y --no-install-recommends \ - --reinstall libibverbs-dev rdma-core ibverbs-utils libibumad-dev \ - libnuma-dev librdmacm-dev ibverbs-providers \ - autoconf libtool - -# Build UCX from source -ARG NIXL_UCX_REF=v1.19.x -RUN rm -rf /opt/hpcx/ucx && \ - rm -rf /usr/local/ucx && \ - cd /usr/local/src && \ - git clone https://github.com/openucx/ucx.git && \ - cd ucx && \ - git checkout $NIXL_UCX_REF && \ - ./autogen.sh && ./configure \ - --prefix=/usr/local/ucx \ - --enable-shared \ - --disable-static \ - --disable-doxygen-doc \ - --enable-optimizations \ - --enable-cma \ - --enable-devel-headers \ - --with-cuda=/usr/local/cuda \ - --with-verbs \ - --with-efa \ - --with-dm \ - --with-gdrcopy=/usr/local \ - --enable-mt && \ - make -j && \ - make -j install-strip && \ - ldconfig + cmake meson ninja-build pybind11-dev patchelf net-tools \ + build-essential protobuf-compiler libssl-dev pkg-config \ + clang libclang-dev git rapidjson-dev zlib1g-dev && \ + pip install --break-system-packages meson-python wheel build + +# Build UCX + NIXL for x86/hopper until its fully tested on GB200 +RUN if [ "$MODE" = "hopper" ]; then \ + apt-get install -y --no-install-recommends \ + libibverbs-dev rdma-core ibverbs-utils libibumad-dev \ + libnuma-dev librdmacm-dev ibverbs-providers autoconf libtool && \ + # UCX from source + rm -rf /opt/hpcx/ucx /usr/local/ucx && \ + cd /usr/local/src && \ + git clone https://github.com/openucx/ucx.git && \ + cd ucx && git checkout $NIXL_UCX_REF && \ + ./autogen.sh && \ + ./configure \ + --prefix=/usr/local/ucx \ + --enable-shared \ + --disable-static \ + --disable-doxygen-doc \ + --enable-optimizations \ + --enable-cma \ + --enable-devel-headers \ + --with-cuda=/usr/local/cuda \ + --with-verbs \ + --with-efa \ + --with-dm \ + --with-gdrcopy=/usr/local \ + --enable-mt && \ + make -j && make install-strip && ldconfig && \ + # NIXL + git clone https://github.com/ai-dynamo/nixl.git /opt/nixl && \ + cd /opt/nixl && git checkout $NIXL_TAG && \ + pip install --break-system-packages . \ + --config-settings="setup-args=-Ducx_path=/usr/local/ucx"; \ + fi ENV LD_LIBRARY_PATH=/usr/lib:/usr/local/ucx/lib:$LD_LIBRARY_PATH -ARG NIXL_TAG=0.3.1 -RUN git clone https://github.com/ai-dynamo/nixl.git && cd nixl && git checkout ${NIXL_TAG} && pip install --break-system-packages . --config-settings=setup-args="-Ducx_path=/usr/local/ucx" - -WORKDIR /sgl-workspace - -# Allow forceful shutdown of inflight requests -ENV SGL_FORCE_SHUTDOWN=1 - +# Dynamo WORKDIR /sgl-workspace RUN git clone https://github.com/ai-dynamo/dynamo.git -# install dynamo in editable mode -WORKDIR /sgl-workspace/dynamo -# Rust build/dev dependencies -RUN apt update -y && \ - apt install --no-install-recommends -y \ - build-essential \ - protobuf-compiler \ - cmake \ - libssl-dev \ - pkg-config \ - clang \ - libclang-dev \ - git - -# Define Rust target based on ARCH_ALT ARG -ARG RUSTARCH=${ARCH_ALT}-unknown-linux-gnu - ENV RUSTUP_HOME=/usr/local/rustup \ CARGO_HOME=/usr/local/cargo \ - PATH=/usr/local/cargo/bin:$PATH \ - RUST_VERSION=1.86.0 + PATH=/usr/local/cargo/bin:$PATH -# Install Rust using RUSTARCH derived from ARCH_ALT -RUN wget --tries=3 --waitretry=5 "https://static.rust-lang.org/rustup/archive/1.28.1/${RUSTARCH}/rustup-init" && \ - # TODO: Add SHA check back based on RUSTARCH +RUN wget --tries=3 --waitretry=5 \ + "https://static.rust-lang.org/rustup/archive/1.28.1/${ARCH_ALT}-unknown-linux-gnu/rustup-init" && \ chmod +x rustup-init && \ - ./rustup-init -y --no-modify-path --profile minimal --default-toolchain $RUST_VERSION --default-host ${RUSTARCH} && \ + ./rustup-init -y \ + --no-modify-path \ + --profile minimal \ + --default-toolchain $RUST_VERSION \ + --default-host ${ARCH_ALT}-unknown-linux-gnu && \ rm rustup-init && \ chmod -R a+w $RUSTUP_HOME $CARGO_HOME ARG CARGO_BUILD_JOBS -# Set CARGO_BUILD_JOBS to 16 if not provided -# This is to prevent cargo from building $(nproc) jobs in parallel, -# which might exceed the number of opened files limit. -ENV CARGO_BUILD_JOBS=${CARGO_BUILD_JOBS:-16} +ENV CARGO_BUILD_JOBS=${CARGO_BUILD_JOBS} + +RUN cd dynamo && cargo build --release -RUN cargo build --release +RUN cd dynamo/lib/bindings/python && \ + pip install --break-system-packages -e . && \ + cd /sgl-workspace/dynamo && \ + pip install --break-system-packages . -RUN cd lib/bindings/python && pip install --break-system-packages -e . && cd ../../.. -RUN pip install --break-system-packages . +RUN pip install --break-system-packages sglang-router==0.1.5 -RUN wget --tries=3 --waitretry=5 https://github.com/nats-io/nats-server/releases/download/v2.10.28/nats-server-v2.10.28-${ARCH}.deb && \ +RUN wget --tries=3 --waitretry=5 \ + https://github.com/nats-io/nats-server/releases/download/v2.10.28/\ +nats-server-v2.10.28-${ARCH}.deb && \ dpkg -i nats-server-v2.10.28-${ARCH}.deb && rm nats-server-v2.10.28-${ARCH}.deb ENV ETCD_VERSION="v3.5.21" -RUN wget --tries=3 --waitretry=5 https://github.com/etcd-io/etcd/releases/download/$ETCD_VERSION/etcd-$ETCD_VERSION-linux-${ARCH}.tar.gz -O /tmp/etcd.tar.gz && \ +RUN wget --tries=3 --waitretry=5 \ + https://github.com/etcd-io/etcd/releases/download/${ETCD_VERSION}/\ +etcd-${ETCD_VERSION}-linux-${ARCH}.tar.gz -O /tmp/etcd.tar.gz && \ mkdir -p /usr/local/bin/etcd && \ - tar -xvf /tmp/etcd.tar.gz -C /usr/local/bin/etcd --strip-components=1 && \ + tar -xzf /tmp/etcd.tar.gz \ + -C /usr/local/bin/etcd --strip-components=1 && \ rm /tmp/etcd.tar.gz -ENV PATH=/usr/local/bin/etcd/:$PATH -ARG CMAKE_VERSION=3.31.8 -RUN mkdir /sgl-workspace/cmake_build -WORKDIR /sgl-workspace/cmake_build +ENV PATH=/usr/local/bin/etcd:$PATH -# uninstall CMake +# GenAI Perf RUN apt-get purge -y cmake -# download newer version of CMake -RUN wget https://github.com/Kitware/CMake/releases/download/v${CMAKE_VERSION}/cmake-${CMAKE_VERSION}-linux-$(uname -m).tar.gz && \ - tar -xvzf cmake-${CMAKE_VERSION}-linux-$(uname -m).tar.gz && \ - mv cmake-${CMAKE_VERSION}-linux-$(uname -m) custom_cmake -ENV PATH=/sgl-workspace/cmake_build/custom_cmake/bin:$PATH -# should be 3.31.8 +RUN mkdir /sgl-workspace/cmake_build && \ + cd /sgl-workspace/cmake_build && \ + wget https://github.com/Kitware/CMake/releases/download/v${CMAKE_VERSION}/\ +cmake-${CMAKE_VERSION}-linux-$(uname -m).tar.gz && \ + tar -xzf cmake-${CMAKE_VERSION}-linux-$(uname -m).tar.gz && \ + mv cmake-${CMAKE_VERSION}-linux-$(uname -m) custom_cmake && \ + rm cmake-${CMAKE_VERSION}-linux-$(uname -m).tar.gz + +ENV PATH=/sgl-workspace/cmake_build/custom_cmake/bin:$PATH RUN cmake --version -# Install perf_analyzer and genai-perf -RUN apt-get update -y && \ - apt-get install -y --no-install-recommends \ - rapidjson-dev \ - # jq and curl for polling various endpoints and health checks - jq \ - curl \ - zlib1g-dev - -RUN git clone --depth=1 https://github.com/triton-inference-server/perf_analyzer.git && \ +RUN git clone --depth=1 \ + https://github.com/triton-inference-server/perf_analyzer.git && \ mkdir perf_analyzer/build && \ cmake -B perf_analyzer/build -S perf_analyzer && \ - cmake --build perf_analyzer/build -- -j8 + cmake --build perf_analyzer/build -- -j$(nproc) ENV PATH=/sgl-workspace/perf_analyzer/build/perf_analyzer/src/perf-analyzer-build:$PATH - RUN pip install --break-system-packages genai-perf -# https://pypi.org/project/sglang-router/0.1.5 is latest -RUN pip install sglang-router==0.1.5 +# Enable forceful shutdown of inflight requests +ENV SGL_FORCE_SHUTDOWN=1 WORKDIR /sgl-workspace/dynamo/components/backends/sglang From 9e0baddf2182b7e31df8fe5aa7f63294472de24c Mon Sep 17 00:00:00 2001 From: ishandhanani Date: Thu, 31 Jul 2025 12:16:34 -0700 Subject: [PATCH 2/9] docs(slurm_jobs): add support for GB200 GPUs and SGLang commands --- .../backends/sglang/slurm_jobs/README.md | 51 +++- .../sglang/slurm_jobs/job_script_template.j2 | 34 ++- .../sglang/slurm_jobs/scripts/gb200.sh | 272 ++++++++++++++++++ .../sglang/slurm_jobs/scripts/h100.sh | 189 ++++++++++++ .../sglang/slurm_jobs/scripts/worker_setup.py | 231 +++++++++------ .../sglang/slurm_jobs/submit_job_script.py | 13 +- 6 files changed, 678 insertions(+), 112 deletions(-) create mode 100644 components/backends/sglang/slurm_jobs/scripts/gb200.sh create mode 100644 components/backends/sglang/slurm_jobs/scripts/h100.sh diff --git a/components/backends/sglang/slurm_jobs/README.md b/components/backends/sglang/slurm_jobs/README.md index 19f7c27ada..fc6b437ff6 100644 --- a/components/backends/sglang/slurm_jobs/README.md +++ b/components/backends/sglang/slurm_jobs/README.md @@ -45,6 +45,7 @@ logs/ ## Setup For simplicity of the example, we will make some assumptions about your SLURM cluster: + 1. We assume you have access to a SLURM cluster with multiple GPU nodes available. For functional testing, most setups should be fine. For performance testing, you should aim to allocate groups of nodes that are performantly @@ -61,7 +62,11 @@ For simplicity of the example, we will make some assumptions about your SLURM cl ## Usage +> [!NOTE] +> The logic for finding prefill and decode node IPs in [`job_script_template.j2`](job_script_template.j2) is still a work in progress. You may need to tweak the `srun`/`ip route`/`getent`/`awk` bits for your cluster, especially if your networking or hostname conventions differ. PRs and suggestions welcome. + 1. **Submit a benchmark job**: + ```bash python submit_job_script.py \ --template job_script_template.j2 \ @@ -72,6 +77,7 @@ For simplicity of the example, we will make some assumptions about your SLURM cl ``` **Required arguments**: + - `--template`: Path to Jinja2 template file - `--model-dir`: Model directory path - `--config-dir`: Config directory path @@ -79,26 +85,65 @@ For simplicity of the example, we will make some assumptions about your SLURM cl - `--account`: SLURM account **Optional arguments**: + - `--prefill-nodes`: Number of prefill nodes (default: `2`) - `--decode-nodes`: Number of decode nodes (default: `2`) - `--gpus-per-node`: Number of GPUs per node (default: `8`) - `--network-interface`: Network interface to use (default: `eth3`) - `--job-name`: SLURM job name (default: `dynamo_setup`) - `--time-limit`: Time limit in HH:MM:SS format (default: `01:00:00`) + - `--gpu-type`: GPU type to use, choices: `h100`, `gb200` (default: `h100`) + - `--use-sglang-commands`: Use SGLang commands instead of Dynamo (default: `false`) **Note**: The script automatically calculates the total number of nodes needed based on `--prefill-nodes` and `--decode-nodes` parameters. -2. **Monitor job progress**: +2. **Example with different GPU types**: + + ```bash + # For H100 with Dynamo (default) + python submit_job_script.py \ + --template job_script_template.j2 \ + --model-dir /path/to/model \ + --config-dir /path/to/configs \ + --container-image container-image-uri \ + --account your-slurm-account \ + --gpu-type h100 + + # For GB200 with SGLang + python submit_job_script.py \ + --template job_script_template.j2 \ + --model-dir /path/to/model \ + --config-dir /path/to/configs \ + --container-image container-image-uri \ + --account your-slurm-account \ + --gpu-type gb200 \ + --use-sglang-commands + --gpus-per-node 4 + ``` + +3. **Monitor job progress**: + ```bash squeue -u $USER ``` -3. **Check logs in real-time**: +4. **Check logs in real-time**: + ```bash tail -f logs/{JOB_ID}/log.out ``` -4. **Monitor GPU utilization**: + You can view logs of all prefill or decode workers simultaneously by running: + + ```bash + # prefill workers err (or .out) + tail -f logs/{JOB_ID}/*_prefill.err + + # decode workers err (or .out) + tail -f logs/{JOB_ID}/*_decode.err + ``` + +5. **Monitor GPU utilization**: ```bash tail -f logs/{JOB_ID}/{node}_prefill_gpu_utilization.log ``` diff --git a/components/backends/sglang/slurm_jobs/job_script_template.j2 b/components/backends/sglang/slurm_jobs/job_script_template.j2 index 84e0e33396..a0959fbb91 100755 --- a/components/backends/sglang/slurm_jobs/job_script_template.j2 +++ b/components/backends/sglang/slurm_jobs/job_script_template.j2 @@ -7,6 +7,7 @@ #SBATCH --time={{ time_limit }} #SBATCH --output=logs/%j/log.out #SBATCH --error=logs/%j/log.err +#SBATCH --partition=36x2-a01r # Constants PREFILL_NODES={{ prefill_nodes }} @@ -20,6 +21,8 @@ MODEL_DIR="{{ model_dir }}" CONFIG_DIR="{{ config_dir }}" CONTAINER_IMAGE="{{ container_image }}" NETWORK_INTERFACE="{{ network_interface }}" +GPU_TYPE="{{ gpu_type | default('h100') }}" +USE_SGLANG_COMMANDS="{{ use_sglang_commands | default(false) }}" {% raw %} @@ -36,14 +39,14 @@ for i in "${!nodes[@]}"; do echo "Node $i: ${nodes[$i]}" done -PREFILL_HOST_IP=$(srun --nodes=1 --ntasks=1 --nodelist=${nodes[0]} ifconfig $NETWORK_INTERFACE | grep -oP 'inet \K[0-9.]+') +PREFILL_HOST_IP=$(srun --nodes=1 --ntasks=1 --nodelist=${nodes[0]} ip route get $(getent ahosts ${nodes[0]} | grep STREAM | head -1 | awk '{print $1}') | awk '{for(i=1;i<=NF;i++) if($i=="src") print $(i+1)}') if [ -z "$PREFILL_HOST_IP" ]; then echo "Error: Could not retrieve IP address for prefill host ${nodes[0]} on interface $NETWORK_INTERFACE" exit 1 fi echo "Prefill host IP address: $PREFILL_HOST_IP" -DECODE_HOST_IP=$(srun --nodes=1 --ntasks=1 --nodelist=${nodes[$PREFILL_NODES]} ifconfig $NETWORK_INTERFACE | grep -oP 'inet \K[0-9.]+') +DECODE_HOST_IP=$(srun --nodes=1 --ntasks=1 --nodelist=${nodes[$PREFILL_NODES]} ip route get $(getent ahosts ${nodes[$PREFILL_NODES]} | grep STREAM | head -1 | awk '{print $1}') | awk '{for(i=1;i<=NF;i++) if($i=="src") print $(i+1)}') if [ -z "$DECODE_HOST_IP" ]; then echo "Error: Could not retrieve IP address for decode host ${nodes[$PREFILL_NODES]} on interface $NETWORK_INTERFACE" exit 1 @@ -54,21 +57,25 @@ echo "Decode host IP address: $DECODE_HOST_IP" ENROOT_ARGS="\ --container-image=${CONTAINER_IMAGE} \ --no-container-entrypoint \ - --container-mount-home \ - --no-container-remap-root \ + --no-container-mount-home \ --container-mounts=${MODEL_DIR}:/model/,${CONFIG_DIR}:/configs/,${SCRIPT_DIR}:/scripts/,${OUTPUT_DIR}:/outputs/,${LOG_DIR}:/logs/ \ " +# Build common worker arguments +WORKER_ARGS="--gpu_type ${GPU_TYPE} --gpus_per_node ${GPUS_PER_NODE}" +if [ "$USE_SGLANG_COMMANDS" = "True" ]; then + WORKER_ARGS="${WORKER_ARGS} --use-sglang-commands" +fi + # Launch prefill tasks on the first PREFILL_NODES nodes for i in $(seq 0 $((PREFILL_NODES - 1))); do node=${nodes[$i]} rank=$i echo "Launching prefill task on node ${i} (rank ${rank}): $node" - echo "Srun args: $ENROOT_ARGS --nodes=1 --ntasks=1 --nodelist=$node --output=${LOG_DIR}/${node}_prefill.out --error=${LOG_DIR}/${node}_prefill.err" - echo "Command: python /scripts/worker_setup.py --prefill_host_ip ${PREFILL_HOST_IP} --decode_host_ip ${DECODE_HOST_IP} --rank ${rank} --total_nodes ${PREFILL_NODES} --worker_type prefill --gpus_per_node ${GPUS_PER_NODE} --gpu_utilization_log /logs/${node}_prefill_gpu_utilization.log &" - srun $ENROOT_ARGS --nodes=1 --ntasks=1 --nodelist=$node \ - --output=${LOG_DIR}/${node}_prefill.out --error=${LOG_DIR}/${node}_prefill.err \ - python /scripts/worker_setup.py --prefill_host_ip ${PREFILL_HOST_IP} --decode_host_ip ${DECODE_HOST_IP} --rank ${rank} --total_nodes ${PREFILL_NODES} --worker_type prefill --gpus_per_node ${GPUS_PER_NODE} --gpu_utilization_log /logs/${node}_prefill_gpu_utilization.log & + + cmd="srun $ENROOT_ARGS --nodes=1 --ntasks=1 --nodelist=$node --output=${LOG_DIR}/${node}_prefill.out --error=${LOG_DIR}/${node}_prefill.err python /scripts/worker_setup.py --prefill_host_ip ${PREFILL_HOST_IP} --decode_host_ip ${DECODE_HOST_IP} --rank ${rank} --total_nodes ${PREFILL_NODES} --worker_type prefill --gpu_utilization_log /logs/${node}_prefill_gpu_utilization.log ${WORKER_ARGS}" + echo "$cmd" + $cmd & done # Launch decode tasks on the next DECODE_NODES nodes @@ -76,11 +83,10 @@ for i in $(seq $PREFILL_NODES $((PREFILL_NODES + DECODE_NODES - 1))); do node=${nodes[$i]} rank=$((i - PREFILL_NODES)) echo "Launching decode task on node ${i} (rank ${rank}): $node" - echo "Srun args: $ENROOT_ARGS --nodes=1 --ntasks=1 --nodelist=$node --output=${LOG_DIR}/${node}_decode.out --error=${LOG_DIR}/${node}_decode.err" - echo "Command: python /scripts/worker_setup.py --decode_host_ip ${DECODE_HOST_IP} --prefill_host_ip ${PREFILL_HOST_IP} --rank ${rank} --total_nodes ${DECODE_NODES} --worker_type decode --gpus_per_node ${GPUS_PER_NODE} --gpu_utilization_log /logs/${node}_decode_gpu_utilization.log &" - srun $ENROOT_ARGS --nodes=1 --ntasks=1 --nodelist=$node \ - --output=${LOG_DIR}/${node}_decode.out --error=${LOG_DIR}/${node}_decode.err \ - python /scripts/worker_setup.py --decode_host_ip ${DECODE_HOST_IP} --prefill_host_ip ${PREFILL_HOST_IP} --rank ${rank} --total_nodes ${DECODE_NODES} --worker_type decode --gpus_per_node ${GPUS_PER_NODE} --gpu_utilization_log /logs/${node}_decode_gpu_utilization.log & + + cmd="srun $ENROOT_ARGS --nodes=1 --ntasks=1 --nodelist=$node --output=${LOG_DIR}/${node}_decode.out --error=${LOG_DIR}/${node}_decode.err python /scripts/worker_setup.py --decode_host_ip ${DECODE_HOST_IP} --prefill_host_ip ${PREFILL_HOST_IP} --rank ${rank} --total_nodes ${DECODE_NODES} --worker_type decode --gpu_utilization_log /logs/${node}_decode_gpu_utilization.log ${WORKER_ARGS}" + echo "$cmd" + $cmd & done echo "" diff --git a/components/backends/sglang/slurm_jobs/scripts/gb200.sh b/components/backends/sglang/slurm_jobs/scripts/gb200.sh new file mode 100644 index 0000000000..cbc8cbce88 --- /dev/null +++ b/components/backends/sglang/slurm_jobs/scripts/gb200.sh @@ -0,0 +1,272 @@ +#!/bin/bash +# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 + +# Function to print usage +print_usage() { + echo "Usage: $0 " + echo " mode: prefill or decode" + echo " cmd: dynamo or sglang" + echo "" + echo "Examples:" + echo " $0 prefill dynamo" + echo " $0 decode sglang" + exit 1 +} + +# Check if correct number of arguments provided +if [ $# -ne 2 ]; then + echo "Error: Expected 2 arguments, got $#" + print_usage +fi + +# Parse arguments +mode=$1 +cmd=$2 + +# Validate mode argument +if [ "$mode" != "prefill" ] && [ "$mode" != "decode" ]; then + echo "Error: mode must be 'prefill' or 'decode', got '$mode'" + print_usage +fi + +# Validate cmd argument +if [ "$cmd" != "dynamo" ] && [ "$cmd" != "sglang" ]; then + echo "Error: cmd must be 'dynamo' or 'sglang', got '$cmd'" + print_usage +fi + +echo "Mode: $mode" +echo "Command: $cmd" + + +# Check if required environment variables are set +if [ -z "$HOST_IP" ]; then + echo "Error: HOST_IP environment variable is not set" + exit 1 +fi + +if [ -z "$PORT" ]; then + echo "Error: PORT environment variable is not set" + exit 1 +fi + +if [ -z "$TOTAL_GPUS" ]; then + echo "Error: TOTAL_GPUS environment variable is not set" + exit 1 +fi + +if [ -z "$RANK" ]; then + echo "Error: RANK environment variable is not set" + exit 1 +fi + +if [ -z "$TOTAL_NODES" ]; then + echo "Error: TOTAL_NODES environment variable is not set" + exit 1 +fi + +# TODO: since the args for sglang and dynamo are the same, we can be a bit cleaner here + +# Construct command based on mode and cmd +if [ "$mode" = "prefill" ]; then + if [ "$cmd" = "dynamo" ]; then + # We are not using a init-expert-location file for e2e benchmarking + # We also don't currently have a --deepep-config file for GB200 + # Need to increase --context-length to 10k for 8k1k benchmarking + SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=2048 \ + MC_TE_METRIC=true \ + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE=100000 \ + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=100000 \ + SGLANG_DISAGGREGATION_WAITING_TIMEOUT=100000 \ + SGLANG_MOONCAKE_CUSTOM_MEM_POOL=True \ + MC_FORCE_MNNVL=1 \ + NCCL_MNNVL_ENABLE=1 \ + NCCL_CUMEM_ENABLE=1 \ + SGLANG_USE_MESSAGE_QUEUE_BROADCASTER=0 \ + SGL_DISABLE_TP_MEMORY_INBALANCE_CHECK=1 \ + PYTHONUNBUFFERED=1 \ + python3 components/worker.py \ + --served-model-name deepseek-ai/DeepSeek-R1 \ + --model-path /model/ \ + --skip-tokenizer-init \ + --trust-remote-code \ + --disaggregation-mode prefill \ + --dist-init-addr "$HOST_IP:$PORT" \ + --disaggregation-bootstrap-port 30001 \ + --disaggregation-transfer-backend nixl \ + --nnodes "$TOTAL_NODES" \ + --node-rank "$RANK" \ + --tp-size "$TOTAL_GPUS" \ + --dp-size "$TOTAL_GPUS" \ + --enable-dp-attention \ + --host 0.0.0.0 \ + --decode-log-interval 1 \ + --max-running-requests 6144 \ + --context-length 2716 \ + --disable-radix-cache \ + --enable-deepep-moe \ + --deepep-mode low_latency \ + --moe-dense-tp-size 1 \ + --enable-dp-lm-head \ + --disable-shared-experts-fusion \ + --ep-num-redundant-experts 32 \ + --ep-dispatch-algorithm static \ + --eplb-algorithm deepseek \ + --attention-backend cutlass_mla \ + --watchdog-timeout 1000000 \ + --disable-cuda-graph \ + --chunked-prefill-size 16384 \ + --max-total-tokens 32768 \ + --mem-fraction-static 0.8 \ + --log-level debug + + elif [ "$cmd" = "sglang" ]; then + # GB200 sglang prefill command + # We are not using a init-expert-location file for e2e benchmarking + # We also don't currently have a --deepep-config file for GB200 + # Need to increase --context-length to 10k for 8k1k benchmarking + SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=2048 \ + MC_TE_METRIC=true \ + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE=100000 \ + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=100000 \ + SGLANG_DISAGGREGATION_WAITING_TIMEOUT=100000 \ + SGLANG_MOONCAKE_CUSTOM_MEM_POOL=True \ + NCCL_MNNVL_ENABLE=1 \ + MC_FORCE_MNNVL=1 \ + NCCL_CUMEM_ENABLE=1 \ + SGLANG_USE_MESSAGE_QUEUE_BROADCASTER=0 \ + SGL_DISABLE_TP_MEMORY_INBALANCE_CHECK=1 \ + PYTHONUNBUFFERED=1 \ + python3 -m sglang.launch_server \ + --served-model-name deepseek-ai/DeepSeek-R1 \ + --model-path /model/ \ + --trust-remote-code \ + --disaggregation-mode prefill \ + --dist-init-addr "$HOST_IP:$PORT" \ + --disaggregation-bootstrap-port 30001 \ + --nnodes "$TOTAL_NODES" \ + --node-rank "$RANK" \ + --tp-size "$TOTAL_GPUS" \ + --dp-size "$TOTAL_GPUS" \ + --enable-dp-attention \ + --host 0.0.0.0 \ + --decode-log-interval 1 \ + --max-running-requests 6144 \ + --context-length 2716 \ + --disable-radix-cache \ + --enable-deepep-moe \ + --deepep-mode low_latency \ + --moe-dense-tp-size 1 \ + --enable-dp-lm-head \ + --disable-shared-experts-fusion \ + --ep-num-redundant-experts 32 \ + --ep-dispatch-algorithm static \ + --eplb-algorithm deepseek \ + --attention-backend cutlass_mla \ + --watchdog-timeout 1000000 \ + --disable-cuda-graph \ + --chunked-prefill-size 16384 \ + --max-total-tokens 32768 \ + --mem-fraction-static 0.8 \ + --log-level debug + fi +elif [ "$mode" = "decode" ]; then + if [ "$cmd" = "dynamo" ]; then + # Need to increase --context-length to 10k for 8k1k benchmarking + # We are not using a init-expert-location file for e2e benchmarking + SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=768 \ + MC_TE_METRIC=true \ + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE=100000 \ + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=100000 \ + SGLANG_DISAGGREGATION_WAITING_TIMEOUT=100000 \ + SGLANG_HACK_SEQ_BOOTSTRAP_ROOM=1 \ + SGLANG_MOONCAKE_CUSTOM_MEM_POOL=True \ + NCCL_MNNVL_ENABLE=1 \ + MC_FORCE_MNNVL=1 \ + NCCL_CUMEM_ENABLE=1 \ + SGLANG_USE_MESSAGE_QUEUE_BROADCASTER=0 \ + SGL_DISABLE_TP_MEMORY_INBALANCE_CHECK=1 \ + PYTHONUNBUFFERED=1 \ + python3 components/decode_worker.py \ + --served-model-name deepseek-ai/DeepSeek-R1 \ + --model-path /model/ \ + --skip-tokenizer-init \ + --trust-remote-code \ + --disaggregation-mode decode \ + --dist-init-addr "$HOST_IP:$PORT" \ + --disaggregation-bootstrap-port 30001 \ + --nnodes "$TOTAL_NODES" \ + --node-rank "$RANK" \ + --tp-size "$TOTAL_GPUS" \ + --dp-size "$TOTAL_GPUS" \ + --enable-dp-attention \ + --host 0.0.0.0 \ + --decode-log-interval 1 \ + --max-running-requests 36864 \ + --context-length 2716 \ + --disable-radix-cache \ + --enable-deepep-moe \ + --deepep-mode low_latency \ + --moe-dense-tp-size 1 \ + --enable-dp-lm-head \ + --cuda-graph-bs 768 \ + --disable-shared-experts-fusion \ + --ep-num-redundant-experts 32 \ + --ep-dispatch-algorithm static \ + --eplb-algorithm deepseek \ + --attention-backend cutlass_mla \ + --watchdog-timeout 1000000 \ + --chunked-prefill-size 36864 \ + --mem-fraction-static 0.82 \ + --log-level debug + + elif [ "$cmd" = "sglang" ]; then + # GB200 sglang decode command + # Need to increase --context-length to 10k for 8k1k benchmarking + # We are not using a init-expert-location file for e2e benchmarking + SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=768 \ + MC_TE_METRIC=true \ + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE=100000 \ + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=100000 \ + SGLANG_DISAGGREGATION_WAITING_TIMEOUT=100000 \ + SGLANG_HACK_SEQ_BOOTSTRAP_ROOM=1 \ + SGLANG_MOONCAKE_CUSTOM_MEM_POOL=True \ + NCCL_MNNVL_ENABLE=1 \ + MC_FORCE_MNNVL=1 \ + NCCL_CUMEM_ENABLE=1 \ + SGLANG_USE_MESSAGE_QUEUE_BROADCASTER=0 \ + SGL_DISABLE_TP_MEMORY_INBALANCE_CHECK=1 \ + PYTHONUNBUFFERED=1 \ + python3 -m sglang.launch_server \ + --model-path /model/ \ + --trust-remote-code \ + --disaggregation-mode decode \ + --dist-init-addr "$HOST_IP:$PORT" \ + --disaggregation-bootstrap-port 30001 \ + --nnodes "$TOTAL_NODES" \ + --node-rank "$RANK" \ + --tp-size "$TOTAL_GPUS" \ + --dp-size "$TOTAL_GPUS" \ + --enable-dp-attention \ + --host 0.0.0.0 \ + --decode-log-interval 1 \ + --max-running-requests 36864 \ + --context-length 2716 \ + --disable-radix-cache \ + --enable-deepep-moe \ + --deepep-mode low_latency \ + --moe-dense-tp-size 1 \ + --enable-dp-lm-head \ + --cuda-graph-bs 768 \ + --disable-shared-experts-fusion \ + --ep-num-redundant-experts 32 \ + --ep-dispatch-algorithm static \ + --eplb-algorithm deepseek \ + --attention-backend cutlass_mla \ + --watchdog-timeout 1000000 \ + --chunked-prefill-size 36864 \ + --mem-fraction-static 0.82 \ + --log-level debug + fi +fi diff --git a/components/backends/sglang/slurm_jobs/scripts/h100.sh b/components/backends/sglang/slurm_jobs/scripts/h100.sh new file mode 100644 index 0000000000..b457484e3a --- /dev/null +++ b/components/backends/sglang/slurm_jobs/scripts/h100.sh @@ -0,0 +1,189 @@ +#!/bin/bash +# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 + +# Function to print usage +print_usage() { + echo "Usage: $0 " + echo " mode: prefill or decode" + echo " cmd: dynamo or sglang" + echo "" + echo "Examples:" + echo " $0 prefill dynamo" + echo " $0 decode sglang" + exit 1 +} + +# Check if correct number of arguments provided +if [ $# -ne 2 ]; then + echo "Error: Expected 2 arguments, got $#" + print_usage +fi + +# Parse arguments +mode=$1 +cmd=$2 + +# Validate mode argument +if [ "$mode" != "prefill" ] && [ "$mode" != "decode" ]; then + echo "Error: mode must be 'prefill' or 'decode', got '$mode'" + print_usage +fi + +# Validate cmd argument +if [ "$cmd" != "dynamo" ] && [ "$cmd" != "sglang" ]; then + echo "Error: cmd must be 'dynamo' or 'sglang', got '$cmd'" + print_usage +fi + +echo "Mode: $mode" +echo "Command: $cmd" + + +# Check if required environment variables are set +if [ -z "$HOST_IP" ]; then + echo "Error: HOST_IP environment variable is not set" + exit 1 +fi + +if [ -z "$PORT" ]; then + echo "Error: PORT environment variable is not set" + exit 1 +fi + +if [ -z "$TOTAL_GPUS" ]; then + echo "Error: TOTAL_GPUS environment variable is not set" + exit 1 +fi + +if [ -z "$RANK" ]; then + echo "Error: RANK environment variable is not set" + exit 1 +fi + +if [ -z "$TOTAL_NODES" ]; then + echo "Error: TOTAL_NODES environment variable is not set" + exit 1 +fi + +# Construct command based on mode and cmd +if [ "$mode" = "prefill" ]; then + if [ "$cmd" = "dynamo" ]; then + # H100 dynamo prefill command + python3 components/worker.py \ + --model-path /model/ \ + --served-model-name deepseek-ai/DeepSeek-R1 \ + --skip-tokenizer-init \ + --disaggregation-mode prefill \ + --disaggregation-transfer-backend nixl \ + --disaggregation-bootstrap-port 30001 \ + --dist-init-addr "$HOST_IP:$PORT" \ + --nnodes "$TOTAL_NODES" \ + --node-rank "$RANK" \ + --tp-size "$TOTAL_GPUS" \ + --dp-size "$TOTAL_GPUS" \ + --enable-dp-attention \ + --decode-log-interval 1 \ + --enable-deepep-moe \ + --page-size 1 \ + --trust-remote-code \ + --moe-dense-tp-size 1 \ + --enable-dp-lm-head \ + --disable-radix-cache \ + --watchdog-timeout 1000000 \ + --enable-two-batch-overlap \ + --deepep-mode normal \ + --mem-fraction-static 0.85 \ + --deepep-config /configs/deepep.json \ + --ep-num-redundant-experts 32 \ + --ep-dispatch-algorithm dynamic \ + --eplb-algorithm deepseek + elif [ "$cmd" = "sglang" ]; then + # H100 sglang prefill command + python3 -m sglang.launch_server \ + --model-path /model/ \ + --served-model-name deepseek-ai/DeepSeek-R1 \ + --disaggregation-transfer-backend nixl \ + --disaggregation-mode prefill \ + --dist-init-addr "$HOST_IP:$PORT" \ + --nnodes "$TOTAL_NODES" \ + --node-rank "$RANK" \ + --tp-size "$TOTAL_GPUS" \ + --dp-size "$TOTAL_GPUS" \ + --enable-dp-attention \ + --decode-log-interval 1 \ + --enable-deepep-moe \ + --page-size 1 \ + --host 0.0.0.0 \ + --trust-remote-code \ + --moe-dense-tp-size 1 \ + --enable-dp-lm-head \ + --disable-radix-cache \ + --watchdog-timeout 1000000 \ + --enable-two-batch-overlap \ + --deepep-mode normal \ + --mem-fraction-static 0.85 \ + --ep-num-redundant-experts 32 \ + --ep-dispatch-algorithm dynamic \ + --eplb-algorithm deepseek \ + --deepep-config /configs/deepep.json + fi +elif [ "$mode" = "decode" ]; then + if [ "$cmd" = "dynamo" ]; then + # H100 dynamo decode command + python3 components/decode_worker.py \ + --model-path /model/ \ + --served-model-name deepseek-ai/DeepSeek-R1 \ + --skip-tokenizer-init \ + --disaggregation-mode decode \ + --disaggregation-transfer-backend nixl \ + --disaggregation-bootstrap-port 30001 \ + --dist-init-addr "$HOST_IP:$PORT" \ + --nnodes "$TOTAL_NODES" \ + --node-rank "$RANK" \ + --tp-size "$TOTAL_GPUS" \ + --dp-size "$TOTAL_GPUS" \ + --enable-dp-attention \ + --decode-log-interval 1 \ + --enable-deepep-moe \ + --page-size 1 \ + --trust-remote-code \ + --moe-dense-tp-size 1 \ + --enable-dp-lm-head \ + --disable-radix-cache \ + --watchdog-timeout 1000000 \ + --enable-two-batch-overlap \ + --deepep-mode low_latency \ + --mem-fraction-static 0.835 \ + --ep-num-redundant-experts 32 \ + --cuda-graph-bs 256 + elif [ "$cmd" = "sglang" ]; then + # H100 sglang decode command + python3 -m sglang.launch_server \ + --model-path /model/ \ + --disaggregation-transfer-backend nixl \ + --disaggregation-mode decode \ + --dist-init-addr "$HOST_IP:$PORT" \ + --nnodes "$TOTAL_NODES" \ + --node-rank "$RANK" \ + --tp-size "$TOTAL_GPUS" \ + --dp-size "$TOTAL_GPUS" \ + --enable-dp-attention \ + --decode-log-interval 1 \ + --enable-deepep-moe \ + --page-size 1 \ + --host 0.0.0.0 \ + --trust-remote-code \ + --moe-dense-tp-size 1 \ + --enable-dp-lm-head \ + --disable-radix-cache \ + --watchdog-timeout 1000000 \ + --enable-two-batch-overlap \ + --deepep-mode low_latency \ + --mem-fraction-static 0.835 \ + --ep-num-redundant-experts 32 \ + --cuda-graph-bs 256 + fi +fi + + diff --git a/components/backends/sglang/slurm_jobs/scripts/worker_setup.py b/components/backends/sglang/slurm_jobs/scripts/worker_setup.py index db6ac88531..cfe2aaa634 100644 --- a/components/backends/sglang/slurm_jobs/scripts/worker_setup.py +++ b/components/backends/sglang/slurm_jobs/scripts/worker_setup.py @@ -8,8 +8,8 @@ The script will: - Setup the environment -- Update the YAML config file -- Start Dynamo graphs.disagg service +- Generate the python3 command to run the prefill or decode worker +- Start dynamo (or sglang) - Monitor the GPU utilization """ @@ -165,6 +165,19 @@ def _parse_command_line_args(args: list[str] | None = None) -> argparse.Namespac default=None, help="File to log GPU utilization (default: None)", ) + parser.add_argument( + "--use-sglang-commands", + action="store_true", + default=False, + help="Helper to spin up SGLang servers instead of dynamo. This is helpful for benchmarking SGLang as well", + ) + parser.add_argument( + "--gpu_type", + type=str, + choices=["h100", "gb200"], + default="h100", + help="Type of GPU to use", + ) return parser.parse_args(args) @@ -181,73 +194,114 @@ def _validate_args(args: argparse.Namespace) -> None: raise ValueError("GPUs per node must be at least 1") -def setup_prefill_node( - rank: int, prefill_host_ip: str, total_nodes: int, total_gpus: int -) -> int: +def get_sglang_mini_lb_command_args(prefill_host_ip: str, decode_host_ip: str) -> str: + cmd = ( + f"python3 -m sglang.srt.disaggregation.launch_lb " + f"--prefill http://{prefill_host_ip}:30000 " + f"--decode http://{decode_host_ip}:30000 " + "--host 0.0.0.0 " + "--port 8000 " + "--timeout 3600" + ) + return cmd + + +def setup_env_vars_for_gpu_script( + host_ip: str, + rank: int, + total_gpus: int, + total_nodes: int, + port: int = DIST_INIT_PORT, +): + """Setup environment variables required by GPU scripts (h100.sh, gb200.sh)""" + os.environ["HOST_IP"] = host_ip + os.environ["PORT"] = str(port) + os.environ["TOTAL_GPUS"] = str(total_gpus) + os.environ["RANK"] = str(rank) + os.environ["TOTAL_NODES"] = str(total_nodes) + + logging.info(f"Set HOST_IP: {host_ip}") + logging.info(f"Set PORT: {port}") + logging.info(f"Set TOTAL_GPUS: {total_gpus}") + logging.info(f"Set RANK: {rank}") + logging.info(f"Set TOTAL_NODES: {total_nodes}") + + +def get_gpu_command(worker_type: str, use_sglang_commands: bool, gpu_type: str) -> str: + """Generate command to run the appropriate GPU script""" + script_name = f"{gpu_type}.sh" + script_path = Path(__file__).parent / script_name + mode = worker_type # "prefill" or "decode" + cmd = "sglang" if use_sglang_commands else "dynamo" + + return f"bash {script_path} {mode} {cmd}" + + +def setup_head_prefill_node(prefill_host_ip: str) -> None: """ - Setup the prefill node. + Setup NATS, etcd, ingress, and http servers on the prefill host node. """ - if rank == 0: - logging.info(f"Setting up host prefill node: {rank}") - logging.info(f"Starting nats server on node {rank} with IP {prefill_host_ip}") - - nats_process = run_command("nats-server -js", background=True) - if not nats_process: - raise RuntimeError("Failed to start nats-server") - - etcd_cmd = ( - f"etcd --listen-client-urls {ETCD_LISTEN_ADDR}:{ETCD_CLIENT_PORT} " - f"--advertise-client-urls {ETCD_LISTEN_ADDR}:{ETCD_CLIENT_PORT} " - f"--listen-peer-urls {ETCD_LISTEN_ADDR}:{ETCD_PEER_PORT} " - f"--initial-cluster default=http://{prefill_host_ip}:{ETCD_PEER_PORT}" - ) + logging.info(f"Starting nats server on node {prefill_host_ip}") + + nats_process = run_command("nats-server -js", background=True) + if not nats_process: + raise RuntimeError("Failed to start nats-server") + + logging.info(f"Starting etcd server on node {prefill_host_ip}") + etcd_cmd = ( + f"etcd --listen-client-urls {ETCD_LISTEN_ADDR}:{ETCD_CLIENT_PORT} " + f"--advertise-client-urls {ETCD_LISTEN_ADDR}:{ETCD_CLIENT_PORT} " + f"--listen-peer-urls {ETCD_LISTEN_ADDR}:{ETCD_PEER_PORT} " + f"--initial-cluster default=http://{prefill_host_ip}:{ETCD_PEER_PORT}" + ) - etcd_process = run_command(etcd_cmd, background=True) - if not etcd_process: - raise RuntimeError("Failed to start etcd") + etcd_process = run_command(etcd_cmd, background=True) + if not etcd_process: + raise RuntimeError("Failed to start etcd") - ingress_process = run_command("dynamo run in=http out=dyn", background=True) - if not ingress_process: - raise RuntimeError("Failed to start ingress") + logging.info(f"Starting ingress server on node {prefill_host_ip}") + ingress_process = run_command( + "dynamo run in=http out=dyn --http-port=8000", background=True + ) + if not ingress_process: + raise RuntimeError("Failed to start ingress") + + logging.info( + f"Starting http server on port 9001 for flush_cache endpoint on node {prefill_host_ip}" + ) + cache_flush_server_cmd = "python3 utils/sgl_http_server.py --ns dynamo" + cache_flush_server_process = run_command(cache_flush_server_cmd, background=True) + if not cache_flush_server_process: + raise RuntimeError("Failed to start cache flush server") + +def setup_prefill_node( + rank: int, + prefill_host_ip: str, + total_nodes: int, + total_gpus: int, + use_sglang_commands: bool, + gpu_type: str, +) -> int: + """ + Setup the prefill node. + """ + if not use_sglang_commands: + if rank == 0: + setup_head_prefill_node(prefill_host_ip) + else: + logging.info(f"Setting up child prefill node: {rank}") + if not wait_for_etcd(f"http://{prefill_host_ip}:{ETCD_CLIENT_PORT}"): + raise RuntimeError("Failed to connect to etcd") else: - logging.info(f"Setting up child prefill node: {rank}") - if not wait_for_etcd(f"http://{prefill_host_ip}:{ETCD_CLIENT_PORT}"): - raise RuntimeError("Failed to connect to etcd") + logging.info("Using SGLang servers. No need to setup etcd or nats") - # NOTE: This implements the example in examples/sglang/dsr1-wideep.md - # For other examples, the command might have to be modified. - dynamo_cmd = ( - f"python3 -m dynamo.sglang.worker " - "--model-path /model/ " - "--served-model-name deepseek-ai/DeepSeek-R1 " - "--skip-tokenizer-init " - "--disaggregation-mode prefill " - "--disaggregation-transfer-backend nixl " - "--disaggregation-bootstrap-port 30001 " - f"--dist-init-addr {prefill_host_ip}:{DIST_INIT_PORT} " - f"--nnodes {total_nodes} " - f"--node-rank {rank} " - f"--tp-size {total_gpus} " - f"--dp-size {total_gpus} " - "--enable-dp-attention " - "--decode-log-interval 1 " - "--enable-deepep-moe " - "--page-size 1 " - "--trust-remote-code " - "--moe-dense-tp-size 1 " - "--enable-dp-lm-head " - "--disable-radix-cache " - "--watchdog-timeout 1000000 " - "--enable-two-batch-overlap " - "--deepep-mode normal " - "--mem-fraction-static 0.85 " - "--deepep-config /configs/deepep.json " - "--ep-num-redundant-experts 32 " - "--ep-dispatch-algorithm dynamic " - "--eplb-algorithm deepseek " - ) - return run_command(dynamo_cmd) + # Setup environment variables for GPU script + setup_env_vars_for_gpu_script(prefill_host_ip, rank, total_gpus, total_nodes) + + # Use appropriate GPU script instead of generating command directly + cmd_to_run = get_gpu_command("prefill", use_sglang_commands, gpu_type) + return run_command(cmd_to_run) def setup_decode_node( @@ -256,45 +310,29 @@ def setup_decode_node( prefill_host_ip: str, total_nodes: int, total_gpus: int, + use_sglang_commands: bool, + gpu_type: str, ) -> int: """ Setup the decode node. """ logging.info(f"Setting up child decode node: {rank}") - if not wait_for_etcd(f"http://{prefill_host_ip}:{ETCD_CLIENT_PORT}"): - raise RuntimeError("Failed to connect to etcd") - - dynamo_cmd = ( - "python3 -m dynamo.sglang.decode_worker " - "--model-path /model/ " - "--served-model-name deepseek-ai/DeepSeek-R1 " - "--skip-tokenizer-init " - "--disaggregation-mode decode " - "--disaggregation-transfer-backend nixl " - "--disaggregation-bootstrap-port 30001 " - f"--dist-init-addr {decode_host_ip}:{DIST_INIT_PORT} " - f"--nnodes {total_nodes} " - f"--node-rank {rank} " - f"--tp-size {total_gpus} " - f"--dp-size {total_gpus} " - "--enable-dp-attention " - "--decode-log-interval 1 " - "--enable-deepep-moe " - "--page-size 1 " - "--trust-remote-code " - "--moe-dense-tp-size 1 " - "--enable-dp-lm-head " - "--disable-radix-cache " - "--watchdog-timeout 1000000 " - "--enable-two-batch-overlap " - "--deepep-mode low_latency " - "--mem-fraction-static 0.835 " - "--ep-num-redundant-experts 32 " - "--cuda-graph-bs 256 " - ) + if use_sglang_commands: + sgl_mini_lb_cmd = get_sglang_mini_lb_command_args( + prefill_host_ip, decode_host_ip + ) + run_command(sgl_mini_lb_cmd, background=True) + else: + if not wait_for_etcd(f"http://{prefill_host_ip}:{ETCD_CLIENT_PORT}"): + raise RuntimeError("Failed to connect to etcd") + + # Setup environment variables for GPU script + setup_env_vars_for_gpu_script(decode_host_ip, rank, total_gpus, total_nodes) - return run_command(dynamo_cmd) + # Use appropriate GPU script instead of generating command directly + cmd_to_run = get_gpu_command("decode", use_sglang_commands, gpu_type) + return run_command(cmd_to_run) def setup_env(prefill_host_ip: str): @@ -321,6 +359,7 @@ def main(input_args: list[str] | None = None): logging.info(f"Prefill host IP: {args.prefill_host_ip}") logging.info(f"Decode host IP: {args.decode_host_ip}") logging.info(f"Rank: {args.rank}") + logging.info(f"Use SGLang commands: {args.use_sglang_commands}") setup_env(args.prefill_host_ip) if args.worker_type == "prefill": @@ -329,6 +368,8 @@ def main(input_args: list[str] | None = None): args.prefill_host_ip, args.total_nodes, args.total_nodes * args.gpus_per_node, + args.use_sglang_commands, + args.gpu_type, ) else: setup_decode_node( @@ -337,6 +378,8 @@ def main(input_args: list[str] | None = None): args.prefill_host_ip, args.total_nodes, args.total_nodes * args.gpus_per_node, + args.use_sglang_commands, + args.gpu_type, ) logging.info(f"{args.worker_type.capitalize()} node setup complete") diff --git a/components/backends/sglang/slurm_jobs/submit_job_script.py b/components/backends/sglang/slurm_jobs/submit_job_script.py index 64f492224e..196de92a0d 100644 --- a/components/backends/sglang/slurm_jobs/submit_job_script.py +++ b/components/backends/sglang/slurm_jobs/submit_job_script.py @@ -86,7 +86,7 @@ def _parse_command_line_args(args: list[str] | None = None) -> argparse.Namespac parser.add_argument("--config-dir", required=True, help="Config directory path") parser.add_argument("--container-image", required=True, help="Container image") parser.add_argument( - "--time-limit", default="01:00:00", help="Time limit (HH:MM:SS)" + "--time-limit", default="04:00:00", help="Time limit (HH:MM:SS)" ) parser.add_argument( "--prefill-nodes", type=int, default=2, help="Number of prefill nodes" @@ -100,6 +100,15 @@ def _parse_command_line_args(args: list[str] | None = None) -> argparse.Namespac parser.add_argument( "--network-interface", default="eth3", help="Network interface to use" ) + parser.add_argument( + "--gpu-type", choices=["h100", "gb200"], default="h100", help="GPU type to use" + ) + parser.add_argument( + "--use-sglang-commands", + action="store_true", + default=False, + help="Use SGLang commands instead of Dynamo", + ) return parser.parse_args(args) @@ -120,6 +129,8 @@ def main(input_args: list[str] | None = None): "container_image": args.container_image, "gpus_per_node": args.gpus_per_node, "network_interface": args.network_interface, + "gpu_type": args.gpu_type, + "use_sglang_commands": args.use_sglang_commands, } with tempfile.NamedTemporaryFile(mode="w", suffix=".sh") as temp_file: From 1801f8d54c46a9a06974189f6eb15697699bccce Mon Sep 17 00:00:00 2001 From: ishandhanani Date: Thu, 31 Jul 2025 12:20:51 -0700 Subject: [PATCH 3/9] chore(scripts): make sglang slurm scripts executable and clean up Dockerfile whitespace --- components/backends/sglang/slurm_jobs/scripts/gb200.sh | 0 components/backends/sglang/slurm_jobs/scripts/h100.sh | 0 container/Dockerfile.sglang-wideep | 6 +++--- 3 files changed, 3 insertions(+), 3 deletions(-) mode change 100644 => 100755 components/backends/sglang/slurm_jobs/scripts/gb200.sh mode change 100644 => 100755 components/backends/sglang/slurm_jobs/scripts/h100.sh diff --git a/components/backends/sglang/slurm_jobs/scripts/gb200.sh b/components/backends/sglang/slurm_jobs/scripts/gb200.sh old mode 100644 new mode 100755 diff --git a/components/backends/sglang/slurm_jobs/scripts/h100.sh b/components/backends/sglang/slurm_jobs/scripts/h100.sh old mode 100644 new mode 100755 diff --git a/container/Dockerfile.sglang-wideep b/container/Dockerfile.sglang-wideep index dfcc0090ac..47086029f1 100644 --- a/container/Dockerfile.sglang-wideep +++ b/container/Dockerfile.sglang-wideep @@ -15,8 +15,8 @@ FROM lmsysorg/sglang:${SGLANG_IMAGE_TAG} -ARG MODE="hopper" -ARG SGLANG_IMAGE_TAG="v0.4.8.post1-cu126" +ARG MODE="hopper" +ARG SGLANG_IMAGE_TAG="v0.4.8.post1-cu126" ARG ARCH="amd64" ARG ARCH_ALT="x86_64" ARG NIXL_UCX_REF="v1.19.x" @@ -67,7 +67,7 @@ RUN if [ "$MODE" = "hopper" ]; then \ ENV LD_LIBRARY_PATH=/usr/lib:/usr/local/ucx/lib:$LD_LIBRARY_PATH -# Dynamo +# Dynamo WORKDIR /sgl-workspace RUN git clone https://github.com/ai-dynamo/dynamo.git From e91ba890f71c39e76a49becb2c59abb1199a85a6 Mon Sep 17 00:00:00 2001 From: ishandhanani Date: Thu, 31 Jul 2025 19:28:31 +0000 Subject: [PATCH 4/9] feat(slurm_jobs): add partition argument to job script submission and template --- .../backends/sglang/slurm_jobs/job_script_template.j2 | 2 +- components/backends/sglang/slurm_jobs/submit_job_script.py | 6 ++++++ 2 files changed, 7 insertions(+), 1 deletion(-) diff --git a/components/backends/sglang/slurm_jobs/job_script_template.j2 b/components/backends/sglang/slurm_jobs/job_script_template.j2 index a0959fbb91..0f9b500bd4 100755 --- a/components/backends/sglang/slurm_jobs/job_script_template.j2 +++ b/components/backends/sglang/slurm_jobs/job_script_template.j2 @@ -7,7 +7,7 @@ #SBATCH --time={{ time_limit }} #SBATCH --output=logs/%j/log.out #SBATCH --error=logs/%j/log.err -#SBATCH --partition=36x2-a01r +#SBATCH --partition={{ partition }} # Constants PREFILL_NODES={{ prefill_nodes }} diff --git a/components/backends/sglang/slurm_jobs/submit_job_script.py b/components/backends/sglang/slurm_jobs/submit_job_script.py index 196de92a0d..ee386929c8 100644 --- a/components/backends/sglang/slurm_jobs/submit_job_script.py +++ b/components/backends/sglang/slurm_jobs/submit_job_script.py @@ -109,6 +109,11 @@ def _parse_command_line_args(args: list[str] | None = None) -> argparse.Namespac default=False, help="Use SGLang commands instead of Dynamo", ) + parser.add_argument( + "--partition", + default="batch", + help="SLURM partition to use", + ) return parser.parse_args(args) @@ -131,6 +136,7 @@ def main(input_args: list[str] | None = None): "network_interface": args.network_interface, "gpu_type": args.gpu_type, "use_sglang_commands": args.use_sglang_commands, + "partition": args.partition, } with tempfile.NamedTemporaryFile(mode="w", suffix=".sh") as temp_file: From acd33abf94f1a5dae623dd91a08ec6f4c5c3cac7 Mon Sep 17 00:00:00 2001 From: ishandhanani Date: Fri, 1 Aug 2025 00:08:57 +0000 Subject: [PATCH 5/9] docs(sglang): update build instructions and Dockerfile image tag for wide expert parallelism setup --- .../backends/sglang/docs/dsr1-wideep-gb200.md | 3 +-- .../backends/sglang/docs/dsr1-wideep-h100.md | 20 +++++++------------ container/Dockerfile.sglang-wideep | 3 ++- 3 files changed, 10 insertions(+), 16 deletions(-) diff --git a/components/backends/sglang/docs/dsr1-wideep-gb200.md b/components/backends/sglang/docs/dsr1-wideep-gb200.md index ea987fae0f..3dc7d24905 100644 --- a/components/backends/sglang/docs/dsr1-wideep-gb200.md +++ b/components/backends/sglang/docs/dsr1-wideep-gb200.md @@ -32,8 +32,7 @@ docker build \ --build-arg SGLANG_IMAGE_TAG=v0.4.9.post6-cu128-gb200 \ --build-arg ARCH=arm64 \ --build-arg ARCH_ALT=aarch64 \ - . \ - --no-cache + . ``` 2. You can run this container on each 4xGB200 node using the following command. diff --git a/components/backends/sglang/docs/dsr1-wideep-h100.md b/components/backends/sglang/docs/dsr1-wideep-h100.md index 326650de74..092222022c 100644 --- a/components/backends/sglang/docs/dsr1-wideep-h100.md +++ b/components/backends/sglang/docs/dsr1-wideep-h100.md @@ -9,22 +9,16 @@ Dynamo supports SGLang's implementation of wide expert parallelism and large sca ## Instructions -1. Pull the SGLang release `v0.4.8.post1` container. We are actively working on validating newer releases. - -```bash -docker pull lmsysorg/sglang:v0.4.8.post1-cu126 -``` - -You can also pull a specific tag from the [lmsys dockerhub](https://hub.docker.com/r/lmsysorg/sglang/tags) - -2. Build the Dynamo container +1. Build the Dynamo container ```bash cd $DYNAMO_ROOT docker build -f container/Dockerfile.sglang-wideep . -t dynamo-wideep --no-cache ``` -3. You can run this container on each 8xH100 node using the following command. +You can use a specific tag from the [lmsys dockerhub](https://hub.docker.com/r/lmsysorg/sglang/tags) by adding `--build-arg SGLANG_IMAGE_TAG=` to the build command. + +2. You can run this container on each 8xH100 node using the following command. > [!IMPORTANT] > We recommend downloading DeepSeek-R1 and then mounting it to the container. You can find the model [here](https://huggingface.co/deepseek-ai/DeepSeek-R1) @@ -47,13 +41,13 @@ docker run \ In each container, you should be in the `/sgl-workspace/dynamo/components/backends/sglang` directory. -4. On the head prefill node, run the helper script provided to generate commands to start the `nats-server`, `etcd`. This script will also tell you which environment variables to export on each node to make deployment easier. +3. On the head prefill node, run the helper script provided to generate commands to start the `nats-server`, `etcd`. This script will also tell you which environment variables to export on each node to make deployment easier. ```bash ./utils/gen_env_vars.sh ``` -5. Run the ingress and prefill worker +4. Run the ingress and prefill worker ```bash # run ingress @@ -93,7 +87,7 @@ python3 -m dynamo.sglang.worker \ On the other prefill node (since this example has 4 total prefill nodes), run the same command but change `--node-rank` to 1,2, and 3 -6. Run the decode worker on the head decode node +5. Run the decode worker on the head decode node ```bash python3 -m dynamo.sglang.decode_worker \ diff --git a/container/Dockerfile.sglang-wideep b/container/Dockerfile.sglang-wideep index 47086029f1..891cd20c5e 100644 --- a/container/Dockerfile.sglang-wideep +++ b/container/Dockerfile.sglang-wideep @@ -13,10 +13,11 @@ # See the License for the specific language governing permissions and # limitations under the License. +ARG SGLANG_IMAGE_TAG="v0.4.10-cu126" + FROM lmsysorg/sglang:${SGLANG_IMAGE_TAG} ARG MODE="hopper" -ARG SGLANG_IMAGE_TAG="v0.4.8.post1-cu126" ARG ARCH="amd64" ARG ARCH_ALT="x86_64" ARG NIXL_UCX_REF="v1.19.x" From f199a6462a6a3cdc0cd1bf03d5578479d0b01418 Mon Sep 17 00:00:00 2001 From: ishandhanani Date: Fri, 1 Aug 2025 00:26:39 +0000 Subject: [PATCH 6/9] go --- components/backends/sglang/README.md | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/components/backends/sglang/README.md b/components/backends/sglang/README.md index a7471b5c32..daa0767e5d 100644 --- a/components/backends/sglang/README.md +++ b/components/backends/sglang/README.md @@ -43,11 +43,11 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1)) ### Large Scale P/D and WideEP Features -| Feature | SGLang | Notes | -|--------------------|--------|-----------------------------------------------------------------------| -| **WideEP** | ✅/🚧 | Full support on H100s/GB200 WIP [PR](https://github.com/sgl-project/sglang/pull/7556) | -| **DP Rank Routing**| 🚧 | Direct routing supported. Process per DP rank is not supported | -| **GB200 Support** | 🚧 | WIP [PR](https://github.com/sgl-project/sglang/pull/7556) | +| Feature | SGLang | Notes | +|---------------------|--------|--------------------------------------------------------------| +| **WideEP** | ✅ | Full support on H100s/GB200 | +| **DP Rank Routing** | 🚧 | Direct routing supported. Dynamo KV router does not router to DP worker | +| **GB200 Support** | ✅ | | ## Quick Start @@ -161,7 +161,7 @@ The migrated request will continue responding to the original request, allowing Below we provide a selected list of advanced examples. Please open up an issue if you'd like to see a specific example! -### Run on multi-node +### Run a multi-node sized model - **[Run a multi-node model](docs/multinode-examples.md)** ### Large scale P/D disaggregation with WideEP @@ -173,10 +173,10 @@ Below we provide a selected list of advanced examples. Please open up an issue i ## Deployment -We currently provide deployment examples for Kubernetes (coming soon!) and SLURM +We currently provide deployment examples for Kubernetes and SLURM. ## Kubernetes -- **[Deploying Dynamo with SGLang on Kubernetes - coming soon!](.)** +- **[Deploying Dynamo with SGLang on Kubernetes](deploy/)** ## SLURM - **[Deploying Dynamo with SGLang on SLURM](slurm_jobs/README.md)** From 21669c8b5cb8ee163443a6663608ebee60b318ac Mon Sep 17 00:00:00 2001 From: ishandhanani Date: Fri, 1 Aug 2025 00:30:30 +0000 Subject: [PATCH 7/9] docs(dsr1-wideep-gb200): update build command syntax for clarity --- components/backends/sglang/docs/dsr1-wideep-gb200.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/components/backends/sglang/docs/dsr1-wideep-gb200.md b/components/backends/sglang/docs/dsr1-wideep-gb200.md index 3dc7d24905..c01eaab079 100644 --- a/components/backends/sglang/docs/dsr1-wideep-gb200.md +++ b/components/backends/sglang/docs/dsr1-wideep-gb200.md @@ -32,7 +32,7 @@ docker build \ --build-arg SGLANG_IMAGE_TAG=v0.4.9.post6-cu128-gb200 \ --build-arg ARCH=arm64 \ --build-arg ARCH_ALT=aarch64 \ - . + . ``` 2. You can run this container on each 4xGB200 node using the following command. From a599079d01acbc7b67f8c3c4e7215cf6ed6b81d7 Mon Sep 17 00:00:00 2001 From: ishandhanani Date: Fri, 1 Aug 2025 02:01:48 +0000 Subject: [PATCH 8/9] fix(docs): update cuda-graph-bs parameter in dsr1-wideep-h100.md --- components/backends/sglang/docs/dsr1-wideep-h100.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/components/backends/sglang/docs/dsr1-wideep-h100.md b/components/backends/sglang/docs/dsr1-wideep-h100.md index 092222022c..7cd694f8f0 100644 --- a/components/backends/sglang/docs/dsr1-wideep-h100.md +++ b/components/backends/sglang/docs/dsr1-wideep-h100.md @@ -115,7 +115,7 @@ python3 -m dynamo.sglang.decode_worker \ --deepep-mode low_latency \ --mem-fraction-static 0.835 \ --ep-num-redundant-experts 32 \ - --cuda-graph-bs 256 + --cuda-graph-bs 128 ``` On the other decode nodes (this example has 9 total decode nodes), run the same command but change `--node-rank` to 1, 2, 3, 4, 5, 6, 7, and 8 From 71b5bfba76f9ae89adfa77a42a5ea0565143f1d6 Mon Sep 17 00:00:00 2001 From: ishandhanani <82981111+ishandhanani@users.noreply.github.com> Date: Fri, 1 Aug 2025 10:00:15 -0700 Subject: [PATCH 9/9] Update components/backends/sglang/docs/dsr1-wideep-h100.md Co-authored-by: Ryan McCormick Signed-off-by: ishandhanani <82981111+ishandhanani@users.noreply.github.com> --- components/backends/sglang/docs/dsr1-wideep-h100.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/components/backends/sglang/docs/dsr1-wideep-h100.md b/components/backends/sglang/docs/dsr1-wideep-h100.md index 7cd694f8f0..fcd43d6e66 100644 --- a/components/backends/sglang/docs/dsr1-wideep-h100.md +++ b/components/backends/sglang/docs/dsr1-wideep-h100.md @@ -162,7 +162,7 @@ curl -X POST http://${HEAD_PREFILL_NODE_IP}:9001/flush_cache ``` 2. **GenAI Perf to benchmark completions with custom dataset** - We provide a script that generates a JSONL file of the ShareGPT dataset and then use GenAI Perf to benchmark the prefill and decode workers. We use ShareGPT in order to leverage the pre-existing EPLB distributions provided by the SGLang team. If you don't want to use ShareGPT - you can also use GenAIPerf's synthetic dataset setup But note you will have to use dynamic EPLB configurations or record your own as the `init-expert-location` provided by SGLang is tuned specifically for the ShareGPT dataset at a 4096 ISL and 5 OSL. + We provide a script that generates a JSONL file of the ShareGPT dataset and then use GenAI Perf to benchmark the prefill and decode workers. We use ShareGPT in order to leverage the pre-existing EPLB distributions provided by the SGLang team. If you don't want to use ShareGPT - you can also use GenAI Perf's synthetic dataset setup But note you will have to use dynamic EPLB configurations or record your own as the `init-expert-location` provided by SGLang is tuned specifically for the ShareGPT dataset at a 4096 ISL and 5 OSL. Example usage: