-
Notifications
You must be signed in to change notification settings - Fork 614
feat: automate slurm handling in sglang example. #1730
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
👋 Hi fsaady! Thank you for contributing to ai-dynamo/dynamo. Just a reminder: The 🚀 |
WalkthroughSeveral new files were added to the Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant SubmitScript as submit_job_script.py
participant Slurm
participant JobScript as job_script_template.j2
participant Worker as worker_setup.py
participant Monitor as monitor_gpu_utilization.sh
User->>SubmitScript: Run submit_job_script.py with parameters
SubmitScript->>JobScript: Render job_script_template.j2 with variables
SubmitScript->>Slurm: Submit job via sbatch
Slurm->>JobScript: Schedule job on allocated nodes
JobScript->>Worker: Launch worker_setup.py (prefill/decode) on nodes
Worker->>Monitor: Start monitor_gpu_utilization.sh (background)
Worker->>Worker: Update YAML config, check etcd, launch services
Note right of Worker: Prefill node 0 also starts etcd and nats-server
Worker-->>JobScript: Workers run Dynamo services
Poem
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 12
♻️ Duplicate comments (1)
examples/sglang/slurm_jobs/scripts/worker_setup.py (1)
134-135
: Previous error handling suggestion is unnecessary.The
wait_for_etcd
function already properly handlesRequestException
, so the suggested fix forcheck_etcd_health
is not needed.
🧹 Nitpick comments (11)
examples/sglang/slurm_jobs/scripts/monitor_gpu_utilization.sh (1)
1-19
: Add error handling for nvidia-smi availability.The monitoring script logic is well-implemented with change detection to reduce log verbosity. However, consider adding error handling to check if nvidia-smi is available before starting the monitoring loop.
+# Check if nvidia-smi is available +if ! command -v nvidia-smi &> /dev/null; then + echo "Error: nvidia-smi not found. This script requires NVIDIA GPU support." + exit 1 +fi + echo "Starting GPU utilization monitoring (checking every ${INTERVAL}s, printing only on changes)..."examples/sglang/slurm_jobs/README.md (1)
22-22
: Specify language for fenced code block.Add language specification to improve documentation clarity and comply with markdown best practices.
-``` +```text logs/examples/sglang/slurm_jobs/submit_job_script.py (1)
5-6
: Remove unused imports.The
sys
andos
imports are not used and should be removed to keep the code clean.-import sys -import os import tempfileexamples/sglang/slurm_jobs/job_script_template.j2 (1)
22-84
: Consider removing unnecessary Jinja2 raw block.The
{% raw %}
block seems unnecessary since the contained code doesn't appear to conflict with Jinja2 syntax.Unless there are specific Jinja2 syntax conflicts, the raw block can be removed to improve template readability:
-{% raw %} - # Initial setup mkdir -p "${OUTPUT_DIR}" "${LOG_DIR}" ... echo "Script finished at $(date)" - -{% endraw %}examples/sglang/slurm_jobs/scripts/worker_setup.py (7)
52-59
: Consider using a dataclass to reduce parameter count.The function has 6 parameters which makes it harder to maintain and call correctly.
Consider grouping related parameters:
from dataclasses import dataclass @dataclass class ClusterConfig: prefill_host_ip: str decode_host_ip: str rank: int total_nodes: int total_gpus: int def update_yaml_config(config_file: Path, cluster: ClusterConfig) -> None:
125-125
: Consider reducing the default timeout period.The default of 1000 retries with 2-second intervals results in a ~33-minute timeout, which seems excessive for service startup.
-def wait_for_etcd(etcd_url: str, max_retries: int = 1000) -> bool: +def wait_for_etcd(etcd_url: str, max_retries: int = 60) -> bool: # 2 minutes default
161-167
: Remove unnecessary else block after return.The else block can be de-indented since the if block returns.
if background: process = subprocess.Popen(cmd, shell=shell, stdout=subprocess.PIPE, stderr=subprocess.PIPE) # noqa: S603 return process - else: - result = subprocess.run(cmd, shell=shell, check=True) # noqa: S603 - return result.returncode + result = subprocess.run(cmd, shell=shell, check=True) # noqa: S603 + return result.returncode
146-167
: Consider safer command execution patterns.While shell injection risk is acknowledged with
# noqa
, consider using list-based commands where possible to avoid shell=True.For commands that don't require shell features, pass them as lists:
# Instead of: run_command("etcd --arg1 value1", shell=True) # Use: subprocess.run(["etcd", "--arg1", "value1"], shell=False)
208-219
: Add more comprehensive argument validation.Consider validating IP addresses and cross-field constraints.
def _validate_args(args: argparse.Namespace) -> None: """Validate command line arguments""" if args.rank < 0: raise ValueError("Rank must be non-negative") + + if args.rank >= args.total_nodes: + raise ValueError(f"Rank {args.rank} must be less than total nodes {args.total_nodes}") if args.total_nodes < 1: raise ValueError("Total nodes must be at least 1") if args.gpus_per_node < 1: raise ValueError("GPUs per node must be at least 1") + + # Validate IP addresses + import ipaddress + try: + ipaddress.ip_address(args.prefill_host_ip) + ipaddress.ip_address(args.decode_host_ip) + except ValueError as e: + raise ValueError(f"Invalid IP address: {e}")
275-275
: Add return type annotation.-def setup_env(prefill_host_ip: str): +def setup_env(prefill_host_ip: str) -> None:
286-286
: Add return type annotation.-def main(args: list[str] | None = None): +def main(args: list[str] | None = None) -> None:
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
examples/sglang/slurm_jobs/.gitignore
(1 hunks)examples/sglang/slurm_jobs/README.md
(1 hunks)examples/sglang/slurm_jobs/job_script_template.j2
(1 hunks)examples/sglang/slurm_jobs/scripts/monitor_gpu_utilization.sh
(1 hunks)examples/sglang/slurm_jobs/scripts/worker_setup.py
(1 hunks)examples/sglang/slurm_jobs/submit_job_script.py
(1 hunks)
🧰 Additional context used
🧠 Learnings (1)
examples/sglang/slurm_jobs/scripts/worker_setup.py (1)
Learnt from: nnshah1
PR: ai-dynamo/dynamo#1444
File: tests/fault_tolerance/utils/metrics.py:30-32
Timestamp: 2025-07-01T13:55:03.916Z
Learning: The `@dynamo_worker()` decorator in the dynamo codebase returns a wrapper that automatically injects the `runtime` parameter before calling the wrapped function. This means callers only need to provide the non-runtime parameters, while the decorator handles injecting the runtime argument automatically. For example, a function with signature `async def get_metrics(runtime, log_dir)` decorated with `@dynamo_worker()` can be called as `get_metrics(log_dir)` because the decorator wrapper injects the runtime parameter.
🧬 Code Graph Analysis (1)
examples/sglang/slurm_jobs/submit_job_script.py (1)
examples/sglang/slurm_jobs/scripts/worker_setup.py (2)
_parse_command_line_args
(169-205)main
(286-315)
🪛 LanguageTool
examples/sglang/slurm_jobs/README.md
[uncategorized] ~11-~11: Loose punctuation mark.
Context: ...## Scripts - submit_job_script.py
: Main script for generating and submitti...
(UNLIKELY_OPENING_PUNCTUATION)
🪛 markdownlint-cli2 (0.17.2)
examples/sglang/slurm_jobs/README.md
22-22: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🪛 Ruff (0.11.9)
examples/sglang/slurm_jobs/submit_job_script.py
5-5: sys
imported but unused
Remove unused import: sys
(F401)
6-6: os
imported but unused
Remove unused import: os
(F401)
🪛 Flake8 (7.2.0)
examples/sglang/slurm_jobs/submit_job_script.py
[error] 5-5: 'sys' imported but unused
(F401)
[error] 6-6: 'os' imported but unused
(F401)
🪛 Pylint (3.3.7)
examples/sglang/slurm_jobs/scripts/worker_setup.py
[refactor] 52-52: Too many arguments (6/5)
(R0913)
[refactor] 52-52: Too many positional arguments (6/5)
(R0917)
[refactor] 52-52: Too many local variables (16/15)
(R0914)
[refactor] 161-166: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it
(R1705)
[refactor] 162-162: Consider using 'with' for resource-allocating operations
(R1732)
🔇 Additional comments (4)
examples/sglang/slurm_jobs/.gitignore (1)
1-2
: LGTM! Appropriate exclusions for runtime-generated directories.The .gitignore correctly excludes logs and outputs directories that are created during SLURM job execution, preventing transient files from being tracked in the repository.
examples/sglang/slurm_jobs/scripts/worker_setup.py (3)
33-37
: Logging setup looks good.The function properly configures logging with timestamps and appropriate formatting.
169-206
: Command-line argument parsing is well-structured.Good use of required arguments, choices, and sensible defaults.
259-273
: Decode node setup follows good patterns.The function properly waits for etcd before starting the decode worker.
WalkthroughA new SLURM job orchestration suite was added under Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant submit_job_script.py
participant SLURM
participant job_script_template.j2
participant worker_setup.py
participant monitor_gpu_utilization.sh
User->>submit_job_script.py: Provide job parameters & paths
submit_job_script.py->>job_script_template.j2: Render SLURM job script
submit_job_script.py->>SLURM: Submit job via sbatch
SLURM->>worker_setup.py: Launch worker nodes (prefill/decode)
worker_setup.py->>monitor_gpu_utilization.sh: Start GPU monitoring (background)
worker_setup.py->>worker_setup.py: Update YAML config, start services
worker_setup.py->>SLURM: Run distributed Dynamo workloads
Poem
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 10
🧹 Nitpick comments (4)
examples/sglang/slurm_jobs/README.md (1)
22-42
: Specify language for the fenced code block.The directory structure code block should specify a language for better rendering and accessibility.
-``` +```text logs/ ├── 3062824/ # Job ID directoryexamples/sglang/slurm_jobs/submit_job_script.py (1)
5-6
: Remove unused imports.The
sys
andos
imports are not used in the script and should be removed to keep the code clean.-import sys -import os import tempfileexamples/sglang/slurm_jobs/scripts/worker_setup.py (2)
52-59
: Consider refactoring to reduce the number of parameters.The function has 6 parameters, which makes it harder to maintain and test. Consider using a dataclass or configuration object to group related parameters.
+from dataclasses import dataclass + +@dataclass +class NodeConfig: + prefill_host_ip: str + decode_host_ip: str + rank: int + total_nodes: int + total_gpus: int + def update_yaml_config( config_file: Path, - prefill_host_ip: str, - decode_host_ip: str, - rank: int, - total_nodes: int, - total_gpus: int, + node_config: NodeConfig, ) -> None:
161-166
: Remove unnecessary else block after return.The else block is not needed since the if block returns.
if background: process = subprocess.Popen(cmd, shell=shell, stdout=subprocess.PIPE, stderr=subprocess.PIPE) # noqa: S603 return process - else: - result = subprocess.run(cmd, shell=shell, check=True) # noqa: S603 - return result.returncode + + result = subprocess.run(cmd, shell=shell, check=True) # noqa: S603 + return result.returncode
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
examples/sglang/slurm_jobs/.gitignore
(1 hunks)examples/sglang/slurm_jobs/README.md
(1 hunks)examples/sglang/slurm_jobs/job_script_template.j2
(1 hunks)examples/sglang/slurm_jobs/scripts/monitor_gpu_utilization.sh
(1 hunks)examples/sglang/slurm_jobs/scripts/worker_setup.py
(1 hunks)examples/sglang/slurm_jobs/submit_job_script.py
(1 hunks)
🧰 Additional context used
🧠 Learnings (1)
examples/sglang/slurm_jobs/scripts/worker_setup.py (1)
Learnt from: nnshah1
PR: ai-dynamo/dynamo#1444
File: tests/fault_tolerance/utils/metrics.py:30-32
Timestamp: 2025-07-01T13:55:03.916Z
Learning: The `@dynamo_worker()` decorator in the dynamo codebase returns a wrapper that automatically injects the `runtime` parameter before calling the wrapped function. This means callers only need to provide the non-runtime parameters, while the decorator handles injecting the runtime argument automatically. For example, a function with signature `async def get_metrics(runtime, log_dir)` decorated with `@dynamo_worker()` can be called as `get_metrics(log_dir)` because the decorator wrapper injects the runtime parameter.
🧬 Code Graph Analysis (2)
examples/sglang/slurm_jobs/submit_job_script.py (1)
examples/sglang/slurm_jobs/scripts/worker_setup.py (2)
_parse_command_line_args
(169-205)main
(286-315)
examples/sglang/slurm_jobs/scripts/worker_setup.py (1)
examples/sglang/slurm_jobs/submit_job_script.py (2)
_parse_command_line_args
(34-48)main
(51-71)
🪛 LanguageTool
examples/sglang/slurm_jobs/README.md
[uncategorized] ~11-~11: Loose punctuation mark.
Context: ...## Scripts - submit_job_script.py
: Main script for generating and submitti...
(UNLIKELY_OPENING_PUNCTUATION)
🪛 markdownlint-cli2 (0.17.2)
examples/sglang/slurm_jobs/README.md
22-22: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🪛 Ruff (0.11.9)
examples/sglang/slurm_jobs/submit_job_script.py
5-5: sys
imported but unused
Remove unused import: sys
(F401)
6-6: os
imported but unused
Remove unused import: os
(F401)
🪛 Flake8 (7.2.0)
examples/sglang/slurm_jobs/submit_job_script.py
[error] 5-5: 'sys' imported but unused
(F401)
[error] 6-6: 'os' imported but unused
(F401)
🪛 Pylint (3.3.7)
examples/sglang/slurm_jobs/scripts/worker_setup.py
[refactor] 52-52: Too many arguments (6/5)
(R0913)
[refactor] 52-52: Too many positional arguments (6/5)
(R0917)
[refactor] 52-52: Too many local variables (16/15)
(R0914)
[refactor] 161-166: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it
(R1705)
[refactor] 162-162: Consider using 'with' for resource-allocating operations
(R1732)
🔇 Additional comments (5)
examples/sglang/slurm_jobs/.gitignore (1)
1-2
: LGTM!The gitignore configuration correctly excludes runtime artifacts (logs and outputs) that will be generated by the SLURM job system, keeping the repository clean.
examples/sglang/slurm_jobs/README.md (1)
1-88
: Excellent documentation quality.The README provides comprehensive coverage of the SLURM job system, including clear usage instructions, detailed log structure explanation, and monitoring guidance. This will greatly help users understand and utilize the automation tools.
examples/sglang/slurm_jobs/scripts/worker_setup.py (3)
1-31
: Well-structured module setup with clear documentation.The imports are organized properly and the network configuration constants follow naming conventions. The module docstring clearly explains the script's purpose.
33-50
: Logging setup and GPU monitoring functions look good.The functions are well-implemented with appropriate error handling for the GPU monitoring subprocess.
169-219
: Well-structured command line argument handling.The argument parsing is comprehensive with appropriate validation and clear help messages.
0f875bc
to
ac35a0f
Compare
Signed-off-by: Fadi Saady <[email protected]>
Signed-off-by: Fadi Saady <[email protected]>
Signed-off-by: Fadi Saady <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall very very solid job here. Added a couple comments that should be addressed. Most of them revolve around the instructions/readme which should be similar if not the same to https://github.com/ai-dynamo/dynamo/tree/main/examples/tensorrt_llm/configs/deepseek_r1/multinode (just not including GB200).
Additionally - can you link this document here to the dsr1-wideep.md
document so users can find it as well?
Signed-off-by: Fadi Saady <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Left some comments for the future. If you want to address the local path in readme that would be great. Otherwise 🚢 !
Signed-off-by: Fadi Saady <[email protected]>
Signed-off-by: Fadi Saady <[email protected]>
Signed-off-by: Fadi Saady <[email protected]>
Signed-off-by: Fadi Saady <[email protected]>
Overview:
Adding slurm job script template and launch wrapper in the sglang examples for starting dynamo serve services on a number of given nodes in a slurm cluster.
Details:
Added a
slurm_jobs
folder underexamples/sglang
that contains SLURM to handle launching Dynamo Serve service on SLURM cluster nodes and monitor GPU activity. The primary purpose is to automate the process of starting prefill and decode nodes to enable running benchmarks easily.This automates the process that is described in examples/sglang/dsr1-wideep.md.
For more details: examples/sglang/slurm_jobs/README.md
Where should the reviewer start?
examples/sglang/slurm_jobs/README.md
examples/sglang/slurm_jobs/submit_job_script.py
Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)
examples/sglang/dsr1-wideep.md
by eliminating the need to connect to cluster nodes and running the worker scripts manually. It sets up the environment for the user, and the user's only responsibility is providing the number of requested prefill and decode nodes.Summary by CodeRabbit
New Features
Documentation
Chores
.gitignore
to exclude logs and outputs from version control in the SLURM jobs example.