-
Notifications
You must be signed in to change notification settings - Fork 614
feat: sglang + gb200 #2223
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: sglang + gb200 #2223
Conversation
…kerfile whitespace
WalkthroughThis change introduces new documentation and scripts for deploying DeepSeek-R1 with SGLang WideEP on GB200 and H100 hardware, refactors SLURM job orchestration to support both SGLang and Dynamo commands, and parameterizes the Dockerfile for flexible builds. It adds GPU-specific shell scripts, updates worker setup logic, and enhances SLURM templates and documentation. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant SLURM Controller
participant Worker Setup Script
participant GPU Shell Script (h100.sh/gb200.sh)
participant SGLang/Dynamo Worker
User->>SLURM Controller: Submit job with --gpu-type and --use-sglang-commands
SLURM Controller->>Worker Setup Script: Launch worker_setup.py with args
Worker Setup Script->>GPU Shell Script: Set env vars, call h100.sh/gb200.sh with mode/command
GPU Shell Script->>SGLang/Dynamo Worker: Launch Python server with parameters
SGLang/Dynamo Worker-->>User: Serve model requests
Estimated code review effort🎯 4 (Complex) | ⏱️ ~40 minutes Possibly related PRs
Poem
Note ⚡️ Unit Test Generation is now available in beta!Learn more here, or try it out under "Finishing Touches" below. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 10
🧹 Nitpick comments (5)
components/backends/sglang/docs/dsr1-wideep-h100.md (1)
134-174
: Minor style suggestion for conciseness.The formatting improvements in the benchmarking section enhance readability. However, consider replacing "in order to" with "to" on Line 171 for more concise writing:
-We use ShareGPT in order to leverage the pre-existing EPLB distributions +We use ShareGPT to leverage the pre-existing EPLB distributionscomponents/backends/sglang/slurm_jobs/scripts/gb200.sh (1)
69-69
: Consider refactoring to reduce code duplication.The TODO comment correctly identifies an opportunity for improvement. The dynamo and sglang commands share many identical parameters - consider extracting common arguments into variables.
Example approach:
# Common arguments COMMON_ARGS="--served-model-name deepseek-ai/DeepSeek-R1 \ --model-path /model/ \ --trust-remote-code \ --dist-init-addr $HOST_IP:$PORT \ --nnodes $TOTAL_NODES \ --node-rank $RANK"components/backends/sglang/docs/dsr1-wideep-gb200.md (1)
172-172
: Provide explicit examples for other decode nodes.The instruction to "change
--node-rank
to 1, 2, 3..." could be clearer with explicit command examples or a script snippet.Consider adding a shell loop example or explicit commands for at least the first few nodes to make it clearer for users:
# For node 1: # ... (same command with --node-rank 1) # For node 2: # ... (same command with --node-rank 2) # etc.container/Dockerfile.sglang-wideep (1)
36-66
: Consider extracting the UCX/NIXL build into a separate script.The conditional build block for UCX and NIXL is quite large and complex. Consider moving this logic to a separate shell script for better maintainability.
This would make the Dockerfile cleaner and the build logic easier to test and modify independently.
components/backends/sglang/slurm_jobs/scripts/worker_setup.py (1)
246-276
: Add error checking for background process startup.Multiple critical processes are started in the background without verifying they're running successfully. While the learnings indicate this is intentional for debugging, consider adding initial health checks.
After starting each background process, consider adding a brief sleep and checking if the process is still running:
time.sleep(2) # Give process time to start if nats_process.poll() is not None: logging.error(f"nats-server exited with code: {nats_process.returncode}") # Continue anyway per the design decision
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (9)
components/backends/sglang/docs/dsr1-wideep-gb200.md
(1 hunks)components/backends/sglang/docs/dsr1-wideep-h100.md
(6 hunks)components/backends/sglang/slurm_jobs/README.md
(3 hunks)components/backends/sglang/slurm_jobs/job_script_template.j2
(4 hunks)components/backends/sglang/slurm_jobs/scripts/gb200.sh
(1 hunks)components/backends/sglang/slurm_jobs/scripts/h100.sh
(1 hunks)components/backends/sglang/slurm_jobs/scripts/worker_setup.py
(7 hunks)components/backends/sglang/slurm_jobs/submit_job_script.py
(3 hunks)container/Dockerfile.sglang-wideep
(1 hunks)
🧰 Additional context used
🧠 Learnings (9)
📓 Common learnings
Learnt from: fsaady
PR: ai-dynamo/dynamo#1730
File: examples/sglang/slurm_jobs/job_script_template.j2:59-59
Timestamp: 2025-07-02T13:20:28.800Z
Learning: In the SLURM job script template at examples/sglang/slurm_jobs/job_script_template.j2, the `--total_nodes` parameter represents the total nodes per worker type (prefill or decode), not the total nodes in the entire cluster. Each worker type needs to know its own group size for distributed coordination.
Learnt from: fsaady
PR: ai-dynamo/dynamo#1730
File: examples/sglang/slurm_jobs/scripts/worker_setup.py:113-116
Timestamp: 2025-07-03T09:44:41.470Z
Learning: In examples/sglang/slurm_jobs/scripts/worker_setup.py, logging the full configuration file content is acceptable because the config file is public, contains only placeholder replacements (no sensitive data), and provides debugging benefits for users who may want to create configurations based on the logged output.
Learnt from: fsaady
PR: ai-dynamo/dynamo#1730
File: examples/sglang/slurm_jobs/scripts/worker_setup.py:230-244
Timestamp: 2025-07-03T10:14:30.570Z
Learning: In examples/sglang/slurm_jobs/scripts/worker_setup.py, background processes (like nats-server, etcd) are intentionally left running even if later processes fail. This design choice allows users to manually connect to nodes and debug issues without having to restart the entire SLURM job from scratch, providing operational flexibility for troubleshooting in cluster environments.
📚 Learning: in examples/sglang/slurm_jobs/scripts/worker_setup.py, background processes (like nats-server, etcd)...
Learnt from: fsaady
PR: ai-dynamo/dynamo#1730
File: examples/sglang/slurm_jobs/scripts/worker_setup.py:230-244
Timestamp: 2025-07-03T10:14:30.570Z
Learning: In examples/sglang/slurm_jobs/scripts/worker_setup.py, background processes (like nats-server, etcd) are intentionally left running even if later processes fail. This design choice allows users to manually connect to nodes and debug issues without having to restart the entire SLURM job from scratch, providing operational flexibility for troubleshooting in cluster environments.
Applied to files:
components/backends/sglang/docs/dsr1-wideep-h100.md
components/backends/sglang/slurm_jobs/scripts/h100.sh
components/backends/sglang/slurm_jobs/README.md
components/backends/sglang/slurm_jobs/scripts/gb200.sh
components/backends/sglang/slurm_jobs/job_script_template.j2
container/Dockerfile.sglang-wideep
components/backends/sglang/slurm_jobs/scripts/worker_setup.py
📚 Learning: in the slurm job script template at examples/sglang/slurm_jobs/job_script_template.j2, the `--total_...
Learnt from: fsaady
PR: ai-dynamo/dynamo#1730
File: examples/sglang/slurm_jobs/job_script_template.j2:59-59
Timestamp: 2025-07-02T13:20:28.800Z
Learning: In the SLURM job script template at examples/sglang/slurm_jobs/job_script_template.j2, the `--total_nodes` parameter represents the total nodes per worker type (prefill or decode), not the total nodes in the entire cluster. Each worker type needs to know its own group size for distributed coordination.
Applied to files:
components/backends/sglang/docs/dsr1-wideep-h100.md
components/backends/sglang/slurm_jobs/scripts/h100.sh
components/backends/sglang/slurm_jobs/README.md
components/backends/sglang/slurm_jobs/submit_job_script.py
components/backends/sglang/slurm_jobs/scripts/gb200.sh
components/backends/sglang/docs/dsr1-wideep-gb200.md
components/backends/sglang/slurm_jobs/job_script_template.j2
components/backends/sglang/slurm_jobs/scripts/worker_setup.py
📚 Learning: the sglang `async_encode` method does not support streaming options, so collecting all embeddings be...
Learnt from: t-ob
PR: ai-dynamo/dynamo#1290
File: launch/dynamo-run/src/subprocess/sglang_inc.py:80-110
Timestamp: 2025-06-03T10:17:51.711Z
Learning: The sglang `async_encode` method does not support streaming options, so collecting all embeddings before yielding is the correct approach for embedding requests.
Applied to files:
components/backends/sglang/docs/dsr1-wideep-h100.md
📚 Learning: in fault tolerance test configurations, the `resources` section under `serviceargs` specifies resour...
Learnt from: nnshah1
PR: ai-dynamo/dynamo#1444
File: tests/fault_tolerance/configs/agg_tp_1_dp_8.yaml:31-38
Timestamp: 2025-07-01T15:33:53.262Z
Learning: In fault tolerance test configurations, the `resources` section under `ServiceArgs` specifies resources per individual worker, not total resources for all workers. So `workers: 8` with `gpu: '1'` means 8 workers × 1 GPU each = 8 GPUs total.
Applied to files:
components/backends/sglang/docs/dsr1-wideep-h100.md
components/backends/sglang/slurm_jobs/README.md
components/backends/sglang/slurm_jobs/scripts/worker_setup.py
📚 Learning: in examples/sglang/slurm_jobs/scripts/worker_setup.py, logging the full configuration file content i...
Learnt from: fsaady
PR: ai-dynamo/dynamo#1730
File: examples/sglang/slurm_jobs/scripts/worker_setup.py:113-116
Timestamp: 2025-07-03T09:44:41.470Z
Learning: In examples/sglang/slurm_jobs/scripts/worker_setup.py, logging the full configuration file content is acceptable because the config file is public, contains only placeholder replacements (no sensitive data), and provides debugging benefits for users who may want to create configurations based on the logged output.
Applied to files:
components/backends/sglang/slurm_jobs/scripts/h100.sh
components/backends/sglang/slurm_jobs/README.md
components/backends/sglang/slurm_jobs/job_script_template.j2
components/backends/sglang/slurm_jobs/scripts/worker_setup.py
📚 Learning: in multi-node setups with head/worker architecture, the head node typically doesn't need environment...
Learnt from: GuanLuo
PR: ai-dynamo/dynamo#1371
File: examples/llm/benchmarks/vllm_multinode_setup.sh:18-25
Timestamp: 2025-06-05T01:46:15.509Z
Learning: In multi-node setups with head/worker architecture, the head node typically doesn't need environment variables pointing to its own services (like NATS_SERVER, ETCD_ENDPOINTS) because local processes can access them via localhost. Only worker nodes need these environment variables to connect to the head node's external IP address.
Applied to files:
components/backends/sglang/slurm_jobs/job_script_template.j2
components/backends/sglang/slurm_jobs/scripts/worker_setup.py
📚 Learning: in dockerfiles, when appending to environment variables that may not exist in the base image, docker...
Learnt from: grahamking
PR: ai-dynamo/dynamo#1177
File: container/Dockerfile.vllm:102-105
Timestamp: 2025-05-28T22:54:46.875Z
Learning: In Dockerfiles, when appending to environment variables that may not exist in the base image, Docker validation will fail if you reference undefined variables with ${VARIABLE} syntax. In such cases, setting the environment variable directly (e.g., ENV CPATH=/usr/include) rather than appending is the appropriate approach.
Applied to files:
container/Dockerfile.sglang-wideep
📚 Learning: in components/backends/sglang/deploy/agg_router.yaml, the clear_namespace command is intentionally d...
Learnt from: biswapanda
PR: ai-dynamo/dynamo#2137
File: components/backends/sglang/deploy/agg_router.yaml:0-0
Timestamp: 2025-07-28T17:00:07.968Z
Learning: In components/backends/sglang/deploy/agg_router.yaml, the clear_namespace command is intentionally designed to block the router from starting if it fails (using &&). This is a deliberate design decision where namespace clearing is a critical prerequisite and the router should not start with an uncleared namespace.
Applied to files:
container/Dockerfile.sglang-wideep
🧬 Code Graph Analysis (2)
components/backends/sglang/slurm_jobs/scripts/h100.sh (1)
components/backends/sglang/slurm_jobs/scripts/gb200.sh (1)
print_usage
(6-15)
components/backends/sglang/slurm_jobs/scripts/gb200.sh (1)
components/backends/sglang/slurm_jobs/scripts/h100.sh (1)
print_usage
(6-15)
🪛 LanguageTool
components/backends/sglang/docs/dsr1-wideep-h100.md
[style] ~171-~171: Consider a more concise word here.
Context: ...ill and decode workers. We use ShareGPT in order to leverage the pre-existing EPLB distribu...
(IN_ORDER_TO_PREMIUM)
🪛 GitHub Actions: Pre Merge Validation of (ai-dynamo/dynamo/refs/pull/2223/merge) by ishandhanani.
components/backends/sglang/slurm_jobs/scripts/h100.sh
[error] 1-1: Script has a shebang but is not marked executable. Run 'chmod +x components/backends/sglang/slurm_jobs/scripts/h100.sh' to fix.
components/backends/sglang/slurm_jobs/scripts/gb200.sh
[error] 1-1: Script has a shebang but is not marked executable. Run 'chmod +x components/backends/sglang/slurm_jobs/scripts/gb200.sh' to fix.
container/Dockerfile.sglang-wideep
[error] 15-24: Trailing whitespace detected and fixed by pre-commit hook 'trailing-whitespace'. Please ensure no trailing whitespace remains.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: pre-merge-rust (.)
- GitHub Check: pre-merge-rust (lib/runtime/examples)
- GitHub Check: Build and Test - vllm
- GitHub Check: pre-merge-rust (lib/bindings/python)
🔇 Additional comments (20)
components/backends/sglang/docs/dsr1-wideep-h100.md (2)
60-60
: LGTM on the updated ingress command.The change from
dynamo run in=http out=dyn &
topython3 -m dynamo.frontend --http-port=8000 &
aligns with the modular approach introduced in this PR, where GPU-specific shell scripts now handle command execution.
96-96
: Good fix on the step numbering.Correcting the step number from 7 to 6 maintains proper sequential numbering in the instructions.
components/backends/sglang/slurm_jobs/README.md (3)
95-96
: LGTM on the new optional arguments documentation.The documentation clearly describes the new
--gpu-type
and--use-sglang-commands
flags, which align with the broader refactoring to support multiple GPU architectures and command modes.
100-122
: Excellent examples demonstrating the new functionality.The examples effectively show how to use the new GPU type selection and command mode options, providing clear guidance for both H100 and GB200 deployments.
136-144
: Good enhancement to log monitoring instructions.The addition of commands to tail logs for all prefill or decode workers simultaneously is a valuable operational improvement that aligns with the consolidated job launch approach.
components/backends/sglang/slurm_jobs/submit_job_script.py (3)
89-89
: Time limit increase is reasonable for complex deployments.Increasing the default time limit from 1 hour to 4 hours makes sense for large-scale distributed deployments that may require more setup and initialization time.
103-111
: Well-designed new command-line arguments.The new arguments provide clear choices and sensible defaults:
--gpu-type
with explicit choices and H100 as default--use-sglang-commands
as a boolean flag with clear intent
132-133
: Proper integration of new arguments into template variables.The new arguments are correctly added to the template variables dictionary, ensuring they're available for use in the Jinja2 template.
components/backends/sglang/slurm_jobs/scripts/gb200.sh (6)
5-37
: LGTM on argument validation logic.The usage function and argument validation are well-implemented, providing clear error messages and maintaining consistency with the h100.sh script structure.
43-67
: Thorough environment variable validation.The script properly validates all required environment variables with clear error messages, which will help with debugging deployment issues.
77-122
: GB200-specific configuration looks appropriate.The prefill dynamo command uses GB200-optimized parameters including specific memory fractions, chunked prefill sizes, and attention backends suitable for the hardware.
129-172
: Consistent sglang prefill configuration.The sglang prefill command maintains parameter consistency with the dynamo version while using the appropriate sglang module entry point.
178-222
: Decode worker configuration is well-tuned.The decode worker parameters show appropriate tuning for GB200 hardware with higher max-running-requests and adjusted memory fractions.
228-270
: Consistent decode sglang configuration.The sglang decode command maintains consistency with the dynamo version while properly excluding the --served-model-name parameter that's not needed for decode workers.
components/backends/sglang/slurm_jobs/scripts/h100.sh (5)
5-67
: LGTM on validation logic consistency.The usage function, argument validation, and environment variable validation are identical to gb200.sh, providing consistent user experience across GPU types.
72-100
: H100-specific prefill configuration looks appropriate.The H100 dynamo prefill command uses parameters optimized for H100 hardware, including the deepep-config file reference and appropriate memory fraction settings.
103-129
: Consistent sglang prefill implementation.The sglang prefill command properly mirrors the dynamo configuration while using the sglang.launch_server entry point and adding the required --host parameter.
134-159
: Well-configured H100 decode worker.The decode worker parameters are appropriate for H100 hardware with suitable memory fractions and CUDA graph batch size settings.
162-185
: Consistent decode sglang configuration.The sglang decode command maintains consistency with the dynamo version and properly excludes --served-model-name for decode workers.
components/backends/sglang/slurm_jobs/job_script_template.j2 (1)
42-42
: Verify the IP address retrieval logic works across different network configurations.The new approach using
ip route get
combined withgetent ahosts
is more robust thanifconfig
, but the command is complex and may fail in certain network configurations.Consider adding error handling or fallback mechanisms if the primary method fails. Also, ensure this works correctly in environments where nodes have multiple network interfaces or complex routing rules.
Also applies to: 49-49
…wide expert parallelism setup
…ate Config in trtllm_utils.py; increment nixl-sys version to 0.4.1 in Cargo.lock and Dockerfiles; remove outdated multimodal README.md
Co-authored-by: Ryan McCormick <[email protected]> Signed-off-by: ishandhanani <[email protected]>
Test matrix
Summary by CodeRabbit
New Features
gb200.sh
,h100.sh
) for streamlined job launching on GB200 and H100 GPUs, supporting both Dynamo and SGLang commands.Improvements
Bug Fixes
Chores