feat: automate slurm handling in sglang example. #1730

fsaady · 2025-07-02T10:04:06Z

Overview:

Adding slurm job script template and launch wrapper in the sglang examples for starting dynamo serve services on a number of given nodes in a slurm cluster.

Details:

Added a slurm_jobs folder under examples/sglang that contains SLURM to handle launching Dynamo Serve service on SLURM cluster nodes and monitor GPU activity. The primary purpose is to automate the process of starting prefill and decode nodes to enable running benchmarks easily.
This automates the process that is described in examples/sglang/dsr1-wideep.md.
For more details: examples/sglang/slurm_jobs/README.md

Where should the reviewer start?

examples/sglang/slurm_jobs/README.md
examples/sglang/slurm_jobs/submit_job_script.py

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Relates to: feat: add dynamo components for sglang #1721, it simplifies the process introduced in examples/sglang/dsr1-wideep.md by eliminating the need to connect to cluster nodes and running the worker scripts manually. It sets up the environment for the user, and the user's only responsibility is providing the number of requested prefill and decode nodes.

Summary by CodeRabbit

New Features
- Added comprehensive SLURM job management scripts and templates for launching and benchmarking distributed Dynamo Serve workloads on SLURM clusters.
- Introduced automated job submission, worker setup, and GPU utilization monitoring tools.
- Provided detailed documentation and usage instructions for deploying and monitoring jobs, including log organization and real-time resource tracking.
Documentation
- Added a README with instructions and explanations for all scripts and workflows in the SLURM jobs example directory.
Chores
- Added a .gitignore to exclude logs and outputs from version control in the SLURM jobs example.

copy-pr-bot · 2025-07-02T10:04:10Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

github-actions · 2025-07-02T10:04:14Z

👋 Hi fsaady! Thank you for contributing to ai-dynamo/dynamo.

Just a reminder: The NVIDIA Test Github Validation CI runs an essential subset of the testing framework to quickly catch errors.Your PR reviewers may elect to test the changes comprehensively before approving your changes.

🚀

coderabbitai · 2025-07-02T10:08:23Z

Walkthrough

Several new files were added to the examples/sglang/slurm_jobs directory to enable automated launching, configuration, and monitoring of distributed Dynamo Serve benchmarking jobs on SLURM clusters. These include a .gitignore, a comprehensive README, SLURM job template, job submission script, worker setup script, and a GPU utilization monitoring script.

Changes

File(s)	Change Summary
examples/sglang/slurm_jobs/.gitignore	Added to ignore all files in `logs/` and `outputs/` directories from git tracking.
examples/sglang/slurm_jobs/README.md	New README detailing SLURM job orchestration, script usage, logging, and monitoring for Dynamo Serve benchmarking.
examples/sglang/slurm_jobs/job_script_template.j2	Added Jinja2 template for dynamic SLURM job scripts orchestrating distributed prefill and decode nodes.
examples/sglang/slurm_jobs/scripts/monitor_gpu_utilization.sh	New Bash script for real-time GPU utilization monitoring, printing changes with timestamps.
examples/sglang/slurm_jobs/scripts/worker_setup.py	Python script to configure and launch prefill/decode workers, update configs, monitor etcd, and log GPU usage.
examples/sglang/slurm_jobs/submit_job_script.py	Python script to render the SLURM job template, generate job scripts, and submit jobs via `sbatch`.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant SubmitScript as submit_job_script.py
    participant Slurm
    participant JobScript as job_script_template.j2
    participant Worker as worker_setup.py
    participant Monitor as monitor_gpu_utilization.sh

    User->>SubmitScript: Run submit_job_script.py with parameters
    SubmitScript->>JobScript: Render job_script_template.j2 with variables
    SubmitScript->>Slurm: Submit job via sbatch
    Slurm->>JobScript: Schedule job on allocated nodes
    JobScript->>Worker: Launch worker_setup.py (prefill/decode) on nodes
    Worker->>Monitor: Start monitor_gpu_utilization.sh (background)
    Worker->>Worker: Update YAML config, check etcd, launch services
    Note right of Worker: Prefill node 0 also starts etcd and nats-server
    Worker-->>JobScript: Workers run Dynamo services

Poem

In the warren of SLURM, the jobs now hop,
Prefill and decode, each takes its stop.
Scripts set the stage, logs kept neat,
GPU stats thump with a rhythmic beat.
With templates and YAML, the clusters align—
Oh, what a burrowed, benchmarked design!
🐰✨

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Explain this complex logic.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai explain this code block.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and explain its main purpose.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 12

♻️ Duplicate comments (1)

examples/sglang/slurm_jobs/scripts/worker_setup.py (1)

134-135: Previous error handling suggestion is unnecessary.

The wait_for_etcd function already properly handles RequestException, so the suggested fix for check_etcd_health is not needed.

🧹 Nitpick comments (11)

examples/sglang/slurm_jobs/scripts/monitor_gpu_utilization.sh (1)
1-19: Add error handling for nvidia-smi availability.

The monitoring script logic is well-implemented with change detection to reduce log verbosity. However, consider adding error handling to check if nvidia-smi is available before starting the monitoring loop.
+# Check if nvidia-smi is available
+if ! command -v nvidia-smi &> /dev/null; then
+    echo "Error: nvidia-smi not found. This script requires NVIDIA GPU support."
+    exit 1
+fi
+
 echo "Starting GPU utilization monitoring (checking every ${INTERVAL}s, printing only on changes)..."
examples/sglang/slurm_jobs/README.md (1)
22-22: Specify language for fenced code block.

Add language specification to improve documentation clarity and comply with markdown best practices.
-```
+```text
 logs/
examples/sglang/slurm_jobs/submit_job_script.py (1)
5-6: Remove unused imports.

The sys and os imports are not used and should be removed to keep the code clean.
-import sys
-import os
 import tempfile
examples/sglang/slurm_jobs/job_script_template.j2 (1)
22-84: Consider removing unnecessary Jinja2 raw block.

The {% raw %} block seems unnecessary since the contained code doesn't appear to conflict with Jinja2 syntax.

Unless there are specific Jinja2 syntax conflicts, the raw block can be removed to improve template readability:
-{% raw %}
-
 # Initial setup
 mkdir -p "${OUTPUT_DIR}" "${LOG_DIR}"
 ...
 echo "Script finished at $(date)"
-
-{% endraw %}
examples/sglang/slurm_jobs/scripts/worker_setup.py (7)
52-59: Consider using a dataclass to reduce parameter count.

The function has 6 parameters which makes it harder to maintain and call correctly.

Consider grouping related parameters:
from dataclasses import dataclass

@dataclass
class ClusterConfig:
    prefill_host_ip: str
    decode_host_ip: str
    rank: int
    total_nodes: int
    total_gpus: int

def update_yaml_config(config_file: Path, cluster: ClusterConfig) -> None:
125-125: Consider reducing the default timeout period.

The default of 1000 retries with 2-second intervals results in a ~33-minute timeout, which seems excessive for service startup.
-def wait_for_etcd(etcd_url: str, max_retries: int = 1000) -> bool:
+def wait_for_etcd(etcd_url: str, max_retries: int = 60) -> bool:  # 2 minutes default
161-167: Remove unnecessary else block after return.

The else block can be de-indented since the if block returns.
     if background:
         process = subprocess.Popen(cmd, shell=shell, stdout=subprocess.PIPE, stderr=subprocess.PIPE)  # noqa: S603
         return process
-    else:
-        result = subprocess.run(cmd, shell=shell, check=True)  # noqa: S603
-        return result.returncode
+    result = subprocess.run(cmd, shell=shell, check=True)  # noqa: S603
+    return result.returncode
146-167: Consider safer command execution patterns.

While shell injection risk is acknowledged with # noqa, consider using list-based commands where possible to avoid shell=True.

For commands that don't require shell features, pass them as lists:
# Instead of: run_command("etcd --arg1 value1", shell=True)
# Use: subprocess.run(["etcd", "--arg1", "value1"], shell=False)
208-219: Add more comprehensive argument validation.

Consider validating IP addresses and cross-field constraints.
 def _validate_args(args: argparse.Namespace) -> None:
     """Validate command line arguments"""
     if args.rank < 0:
         raise ValueError("Rank must be non-negative")
+    
+    if args.rank >= args.total_nodes:
+        raise ValueError(f"Rank {args.rank} must be less than total nodes {args.total_nodes}")
 
     if args.total_nodes < 1:
         raise ValueError("Total nodes must be at least 1")
 
     if args.gpus_per_node < 1:
         raise ValueError("GPUs per node must be at least 1")
+    
+    # Validate IP addresses
+    import ipaddress
+    try:
+        ipaddress.ip_address(args.prefill_host_ip)
+        ipaddress.ip_address(args.decode_host_ip)
+    except ValueError as e:
+        raise ValueError(f"Invalid IP address: {e}")
275-275: Add return type annotation.
-def setup_env(prefill_host_ip: str):
+def setup_env(prefill_host_ip: str) -> None:
286-286: Add return type annotation.
-def main(args: list[str] | None = None):
+def main(args: list[str] | None = None) -> None:

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ee86bad and f06f481.

📒 Files selected for processing (6)

examples/sglang/slurm_jobs/.gitignore (1 hunks)
examples/sglang/slurm_jobs/README.md (1 hunks)
examples/sglang/slurm_jobs/job_script_template.j2 (1 hunks)
examples/sglang/slurm_jobs/scripts/monitor_gpu_utilization.sh (1 hunks)
examples/sglang/slurm_jobs/scripts/worker_setup.py (1 hunks)
examples/sglang/slurm_jobs/submit_job_script.py (1 hunks)

🧰 Additional context used

🧠 Learnings (1)

examples/sglang/slurm_jobs/scripts/worker_setup.py (1)

Learnt from: nnshah1
PR: ai-dynamo/dynamo#1444
File: tests/fault_tolerance/utils/metrics.py:30-32
Timestamp: 2025-07-01T13:55:03.916Z
Learning: The `@dynamo_worker()` decorator in the dynamo codebase returns a wrapper that automatically injects the `runtime` parameter before calling the wrapped function. This means callers only need to provide the non-runtime parameters, while the decorator handles injecting the runtime argument automatically. For example, a function with signature `async def get_metrics(runtime, log_dir)` decorated with `@dynamo_worker()` can be called as `get_metrics(log_dir)` because the decorator wrapper injects the runtime parameter.

🧬 Code Graph Analysis (1)

examples/sglang/slurm_jobs/submit_job_script.py (1)

examples/sglang/slurm_jobs/scripts/worker_setup.py (2)

_parse_command_line_args (169-205)

main (286-315)

🪛 LanguageTool

examples/sglang/slurm_jobs/README.md

[uncategorized] ~11-~11: Loose punctuation mark.
Context: ...## Scripts - submit_job_script.py: Main script for generating and submitti...

(UNLIKELY_OPENING_PUNCTUATION)

🪛 markdownlint-cli2 (0.17.2)

examples/sglang/slurm_jobs/README.md

22-22: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🪛 Ruff (0.11.9)

examples/sglang/slurm_jobs/submit_job_script.py

5-5: sys imported but unused

Remove unused import: sys

(F401)

6-6: os imported but unused

Remove unused import: os

(F401)

🪛 Flake8 (7.2.0)

examples/sglang/slurm_jobs/submit_job_script.py

[error] 5-5: 'sys' imported but unused

(F401)

[error] 6-6: 'os' imported but unused

(F401)

🪛 Pylint (3.3.7)

examples/sglang/slurm_jobs/scripts/worker_setup.py

[refactor] 52-52: Too many arguments (6/5)

(R0913)

[refactor] 52-52: Too many positional arguments (6/5)

(R0917)

[refactor] 52-52: Too many local variables (16/15)

(R0914)

[refactor] 161-166: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it

(R1705)

[refactor] 162-162: Consider using 'with' for resource-allocating operations

(R1732)

🔇 Additional comments (4)

examples/sglang/slurm_jobs/.gitignore (1)

1-2: LGTM! Appropriate exclusions for runtime-generated directories.

The .gitignore correctly excludes logs and outputs directories that are created during SLURM job execution, preventing transient files from being tracked in the repository.

examples/sglang/slurm_jobs/scripts/worker_setup.py (3)

33-37: Logging setup looks good.

The function properly configures logging with timestamps and appropriate formatting.

169-206: Command-line argument parsing is well-structured.

Good use of required arguments, choices, and sensible defaults.

259-273: Decode node setup follows good patterns.

The function properly waits for etcd before starting the decode worker.

examples/sglang/slurm_jobs/submit_job_script.py

examples/sglang/slurm_jobs/job_script_template.j2

examples/sglang/slurm_jobs/scripts/worker_setup.py

coderabbitai · 2025-07-02T10:11:30Z

Walkthrough

A new SLURM job orchestration suite was added under examples/sglang/slurm_jobs, including scripts, templates, and documentation for launching and benchmarking distributed Dynamo Serve jobs on SLURM clusters. The additions cover job submission, worker setup, GPU monitoring, YAML config management, and detailed usage instructions.

Changes

File(s)	Change Summary
examples/sglang/slurm_jobs/.gitignore	Added to ignore `logs/` and `outputs/` directories in Git.
examples/sglang/slurm_jobs/README.md	New README documenting SLURM job scripts, usage, and log/output structure.
examples/sglang/slurm_jobs/job_script_template.j2	Added Jinja2 SLURM batch script template for distributed prefill/decode node orchestration.
examples/sglang/slurm_jobs/scripts/monitor_gpu_utilization.sh	New Bash script for periodic GPU utilization monitoring and logging.
examples/sglang/slurm_jobs/scripts/worker_setup.py	New Python script for configuring, launching, and managing distributed worker nodes with YAML config updates.
examples/sglang/slurm_jobs/submit_job_script.py	New Python script to generate and submit SLURM jobs from templates, handling argument parsing and job tracking.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant submit_job_script.py
    participant SLURM
    participant job_script_template.j2
    participant worker_setup.py
    participant monitor_gpu_utilization.sh

    User->>submit_job_script.py: Provide job parameters & paths
    submit_job_script.py->>job_script_template.j2: Render SLURM job script
    submit_job_script.py->>SLURM: Submit job via sbatch
    SLURM->>worker_setup.py: Launch worker nodes (prefill/decode)
    worker_setup.py->>monitor_gpu_utilization.sh: Start GPU monitoring (background)
    worker_setup.py->>worker_setup.py: Update YAML config, start services
    worker_setup.py->>SLURM: Run distributed Dynamo workloads

Poem

In the warren of clusters, the scripts now hop,
Prefill and decode, each takes its stop.
Logs and outputs, hidden from view,
GPU stats sparkle, fresh as dew.
With templates and YAML, the jobs are in sync—
A bunny’s delight, more robust than you think!
🐇✨

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Explain this complex logic.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai explain this code block.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and explain its main purpose.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 10

🧹 Nitpick comments (4)

examples/sglang/slurm_jobs/README.md (1)
22-42: Specify language for the fenced code block.

The directory structure code block should specify a language for better rendering and accessibility.
-```
+```text
 logs/
 ├── 3062824/                    # Job ID directory
examples/sglang/slurm_jobs/submit_job_script.py (1)
5-6: Remove unused imports.

The sys and os imports are not used in the script and should be removed to keep the code clean.
-import sys
-import os
 import tempfile
examples/sglang/slurm_jobs/scripts/worker_setup.py (2)
52-59: Consider refactoring to reduce the number of parameters.

The function has 6 parameters, which makes it harder to maintain and test. Consider using a dataclass or configuration object to group related parameters.
+from dataclasses import dataclass
+
+@dataclass
+class NodeConfig:
+    prefill_host_ip: str
+    decode_host_ip: str
+    rank: int
+    total_nodes: int
+    total_gpus: int
+
 def update_yaml_config(
     config_file: Path,
-    prefill_host_ip: str,
-    decode_host_ip: str,
-    rank: int,
-    total_nodes: int,
-    total_gpus: int,
+    node_config: NodeConfig,
 ) -> None:
161-166: Remove unnecessary else block after return.

The else block is not needed since the if block returns.
     if background:
         process = subprocess.Popen(cmd, shell=shell, stdout=subprocess.PIPE, stderr=subprocess.PIPE)  # noqa: S603
         return process
-    else:
-        result = subprocess.run(cmd, shell=shell, check=True)  # noqa: S603
-        return result.returncode
+    
+    result = subprocess.run(cmd, shell=shell, check=True)  # noqa: S603
+    return result.returncode

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ee86bad and f06f481.

📒 Files selected for processing (6)

examples/sglang/slurm_jobs/.gitignore (1 hunks)
examples/sglang/slurm_jobs/README.md (1 hunks)
examples/sglang/slurm_jobs/job_script_template.j2 (1 hunks)
examples/sglang/slurm_jobs/scripts/monitor_gpu_utilization.sh (1 hunks)
examples/sglang/slurm_jobs/scripts/worker_setup.py (1 hunks)
examples/sglang/slurm_jobs/submit_job_script.py (1 hunks)

🧰 Additional context used

🧠 Learnings (1)

examples/sglang/slurm_jobs/scripts/worker_setup.py (1)

Learnt from: nnshah1
PR: ai-dynamo/dynamo#1444
File: tests/fault_tolerance/utils/metrics.py:30-32
Timestamp: 2025-07-01T13:55:03.916Z
Learning: The `@dynamo_worker()` decorator in the dynamo codebase returns a wrapper that automatically injects the `runtime` parameter before calling the wrapped function. This means callers only need to provide the non-runtime parameters, while the decorator handles injecting the runtime argument automatically. For example, a function with signature `async def get_metrics(runtime, log_dir)` decorated with `@dynamo_worker()` can be called as `get_metrics(log_dir)` because the decorator wrapper injects the runtime parameter.

🧬 Code Graph Analysis (2)

examples/sglang/slurm_jobs/submit_job_script.py (1)

examples/sglang/slurm_jobs/scripts/worker_setup.py (2)

_parse_command_line_args (169-205)

main (286-315)

examples/sglang/slurm_jobs/scripts/worker_setup.py (1)

examples/sglang/slurm_jobs/submit_job_script.py (2)

_parse_command_line_args (34-48)

main (51-71)

🪛 LanguageTool

examples/sglang/slurm_jobs/README.md

[uncategorized] ~11-~11: Loose punctuation mark.
Context: ...## Scripts - submit_job_script.py: Main script for generating and submitti...

(UNLIKELY_OPENING_PUNCTUATION)

🪛 markdownlint-cli2 (0.17.2)

examples/sglang/slurm_jobs/README.md

22-22: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🪛 Ruff (0.11.9)

examples/sglang/slurm_jobs/submit_job_script.py

5-5: sys imported but unused

Remove unused import: sys

(F401)

6-6: os imported but unused

Remove unused import: os

(F401)

🪛 Flake8 (7.2.0)

examples/sglang/slurm_jobs/submit_job_script.py

[error] 5-5: 'sys' imported but unused

(F401)

[error] 6-6: 'os' imported but unused

(F401)

🪛 Pylint (3.3.7)

examples/sglang/slurm_jobs/scripts/worker_setup.py

[refactor] 52-52: Too many arguments (6/5)

(R0913)

[refactor] 52-52: Too many positional arguments (6/5)

(R0917)

[refactor] 52-52: Too many local variables (16/15)

(R0914)

[refactor] 161-166: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it

(R1705)

[refactor] 162-162: Consider using 'with' for resource-allocating operations

(R1732)

🔇 Additional comments (5)

examples/sglang/slurm_jobs/.gitignore (1)

1-2: LGTM!

The gitignore configuration correctly excludes runtime artifacts (logs and outputs) that will be generated by the SLURM job system, keeping the repository clean.

examples/sglang/slurm_jobs/README.md (1)

1-88: Excellent documentation quality.

The README provides comprehensive coverage of the SLURM job system, including clear usage instructions, detailed log structure explanation, and monitoring guidance. This will greatly help users understand and utilize the automation tools.

examples/sglang/slurm_jobs/scripts/worker_setup.py (3)

1-31: Well-structured module setup with clear documentation.

The imports are organized properly and the network configuration constants follow naming conventions. The module docstring clearly explains the script's purpose.

33-50: Logging setup and GPU monitoring functions look good.

The functions are well-implemented with appropriate error handling for the GPU monitoring subprocess.

169-219: Well-structured command line argument handling.

The argument parsing is comprehensive with appropriate validation and clear help messages.

examples/sglang/slurm_jobs/scripts/monitor_gpu_utilization.sh

examples/sglang/slurm_jobs/submit_job_script.py

examples/sglang/slurm_jobs/job_script_template.j2

examples/sglang/slurm_jobs/scripts/worker_setup.py

Signed-off-by: Fadi Saady <[email protected]>

examples/sglang/slurm_jobs/README.md

examples/sglang/slurm_jobs/scripts/worker_setup.py

examples/sglang/slurm_jobs/README.md

ishandhanani

Overall very very solid job here. Added a couple comments that should be addressed. Most of them revolve around the instructions/readme which should be similar if not the same to https://github.com/ai-dynamo/dynamo/tree/main/examples/tensorrt_llm/configs/deepseek_r1/multinode (just not including GB200).

Additionally - can you link this document here to the dsr1-wideep.md document so users can find it as well?

examples/sglang/slurm_jobs/README.md

Signed-off-by: Fadi Saady <[email protected]>

examples/sglang/slurm_jobs/submit_job_script.py

examples/sglang/slurm_jobs/README.md

ishandhanani

LGTM!

Left some comments for the future. If you want to address the local path in readme that would be great. Otherwise 🚢 !

Signed-off-by: Fadi Saady <[email protected]>

fsaady requested review from biswapanda, ishandhanani, tedzhouhk, rmccorm4, alec-flowers and grahamking as code owners July 2, 2025 10:04

pull-request-size bot added the size/XL label Jul 2, 2025

github-actions bot added external-contribution Pull request is from an external contributor feat labels Jul 2, 2025

coderabbitai bot reviewed Jul 2, 2025

View reviewed changes

fsaady force-pushed the main branch 2 times, most recently from 0f875bc to ac35a0f Compare July 3, 2025 10:56

feat: automate slurm handling in sglang example.

51baf61

Signed-off-by: Fadi Saady <[email protected]>

fsaady force-pushed the main branch from ac35a0f to 51baf61 Compare July 3, 2025 11:10

fsaady added 3 commits July 3, 2025 13:12

Merge branch 'ai-dynamo:main' into main

fab308c

chore: adding generic node names in README.

34002f9

Signed-off-by: Fadi Saady <[email protected]>

fix: aligning sglang automation with PR 1721.

6ecf4f4

Signed-off-by: Fadi Saady <[email protected]>

ishandhanani reviewed Jul 3, 2025

View reviewed changes

examples/sglang/slurm_jobs/README.md Show resolved Hide resolved

ishandhanani reviewed Jul 3, 2025

View reviewed changes

examples/sglang/slurm_jobs/scripts/worker_setup.py Show resolved Hide resolved

ishandhanani reviewed Jul 3, 2025

View reviewed changes

examples/sglang/slurm_jobs/README.md Outdated Show resolved Hide resolved

ishandhanani reviewed Jul 3, 2025

View reviewed changes

examples/sglang/slurm_jobs/README.md Outdated Show resolved Hide resolved

ishandhanani requested changes Jul 3, 2025

View reviewed changes

examples/sglang/slurm_jobs/README.md Outdated Show resolved Hide resolved

examples/sglang/slurm_jobs/README.md Outdated Show resolved Hide resolved

fsaady force-pushed the main branch from e53ab23 to 6c31477 Compare July 4, 2025 10:52

fix: README modifications.

2c0f894

Signed-off-by: Fadi Saady <[email protected]>

fsaady force-pushed the main branch from 6c31477 to 2c0f894 Compare July 4, 2025 10:54

ishandhanani reviewed Jul 4, 2025

View reviewed changes

examples/sglang/slurm_jobs/submit_job_script.py Show resolved Hide resolved

ishandhanani reviewed Jul 4, 2025

View reviewed changes

examples/sglang/slurm_jobs/submit_job_script.py Show resolved Hide resolved

ishandhanani reviewed Jul 4, 2025

View reviewed changes

examples/sglang/slurm_jobs/README.md Outdated Show resolved Hide resolved

ishandhanani reviewed Jul 4, 2025

View reviewed changes

examples/sglang/slurm_jobs/README.md Outdated Show resolved Hide resolved

ishandhanani approved these changes Jul 4, 2025

View reviewed changes

fsaady added 3 commits July 4, 2025 15:27

fix: minor fixes in readme.

c143f82

Signed-off-by: Fadi Saady <[email protected]>

chore: adding copy rights.

a6497a3

Signed-off-by: Fadi Saady <[email protected]>

chore: fix mypy issues.

e109107

Signed-off-by: Fadi Saady <[email protected]>

ishandhanani merged commit 1630f8b into ai-dynamo:main Jul 6, 2025
7 checks passed

coderabbitai bot mentioned this pull request Jul 7, 2025

feat: add gb200 and sglang commands to slurm scripts #1804

Closed

atchernych pushed a commit that referenced this pull request Jul 9, 2025

feat: automate slurm handling in sglang example. (#1730)

be39f08

Signed-off-by: Fadi Saady <[email protected]>

feat: automate slurm handling in sglang example. #1730

feat: automate slurm handling in sglang example. #1730

Uh oh!

Conversation

fsaady commented Jul 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview:

Details:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Summary by CodeRabbit

Uh oh!

copy-pr-bot bot commented Jul 2, 2025

Uh oh!

github-actions bot commented Jul 2, 2025

Uh oh!

coderabbitai bot commented Jul 2, 2025

Walkthrough

Changes

Sequence Diagram(s)

Poem

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot commented Jul 2, 2025

Walkthrough

Changes

Sequence Diagram(s)

Poem

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ishandhanani left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ishandhanani left a comment

Choose a reason for hiding this comment

Uh oh!

fsaady commented Jul 2, 2025 •

edited

Loading