[None] [chore] Make disagg example compatible with recommended usage #7121

kaiyux · 2025-08-21T09:49:43Z

Summary by CodeRabbit

New Features
- Per-group distributed startup for GEN/CTX with explicit node assignment, per-worker logs, and optional NSYS profiling for GEN workers.
- Separate worker and server config generators and an automated server startup sequence with health checks.
Improvements
- Dynamic tasks-per-node detection, stricter shell safety, robust polling/timeouts, clearer diagnostics, structured per-group log directories, and SLURM-aware job cancellation.
Refactor
- Replaced monolithic YAML startup with disaggregated per-group orchestration and updated script interfaces.
Documentation
- README revised to reflect new config generators, startup flow, and logging.

Description

Test Coverage

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

coderabbitai · 2025-08-21T09:49:50Z

📝 Walkthrough

Walkthrough

Replaces the monolithic SLURM YAML workflow with per-role worker/server config generators and orchestration: adds strict shell modes, per-group node assignment and logs, per-worker NSYS profiling support, new gen_worker_config.py and gen_server_config.py, updated start_server.sh/start_worker.sh signatures, health checks, and aggregated log collection.

Changes

Cohort / File(s)	Summary of changes
SLURM orchestration and launch `examples/disaggregated/slurm/benchmark/disaggr_torch.slurm`	Strict shell options; derive `ntasks_per_node` from `SLURM_NTASKS_PER_NODE`; rename container to `disaggregated_serving`; build `logdir` under `slurm-${SLURM_JOB_ID}`; compute per-group node lists; launch per-group `srun` for GEN/CTX with explicit `--nodelist`; per-group logs; updated start signatures for server/workers; use `scancel` for cleanup.
Removed: old YAML generator `examples/disaggregated/slurm/benchmark/gen_yaml.py`	Deleted entirely: removes SLURM-aware monolithic YAML generator, node/task parsing, URL generation, MOE/load-balancer handling, CLI, and YAML output.
Worker config generator (new) `examples/disaggregated/slurm/benchmark/gen_worker_config.py`	Added: generates `ctx_config.yaml` and `gen_config.yaml` in `work_dir`; CLI exposes per-side options (TP/PP sizes, batch/seq, memory fractions, MOE/load-balancer, MTP/speculative options, cache params); writes optional `moe_load_balancer.yaml`.
Server config generator & server start (new signature) `examples/disaggregated/slurm/benchmark/gen_server_config.py`, `examples/disaggregated/slurm/benchmark/start_server.sh`	Added: `gen_server_config.py` polls `work_dir/hostnames/*` to assemble `server_config.yaml`; `start_server.sh` signature changed to `(num_ctx_servers, num_gen_servers, work_dir, script_dir)`, invokes generator, and starts `trtllm-serve -c ${work_dir}/server_config.yaml` with extended timeouts.
Worker startup and profiling (updated) `examples/disaggregated/slurm/benchmark/start_worker.sh`	Signature changed to `(role, instance_id, model_path, port, benchmark_mode, concurrency, enable_pdl, work_dir, nsys_folder)`; resolves role-based config files (`ctx_config.yaml`/`gen_config.yaml`), writes per-role hostname files, launches via `trtllm-llmapi-launch trtllm-serve`, and optionally enables NSYS profiling for `GEN` (NSYS outputs to `${nsys_folder}/nsys_worker_proc_${instance_id}_${SLURM_PROCID}`).
Benchmark runner and log collection (updated) `examples/disaggregated/slurm/benchmark/run_benchmark.sh`	Strict modes and tracing; robust wait loop for `${log_path}/server_config.yaml`; extract hostname/port and poll `/health`; `do_get_logs(log_path, output_folder)` now aggregates `output_gen_.log` and `output_ctx_.log` into per-index summaries; downloads dataset if needed; wraps benchmark runs with preflight/cleanup.
Documentation `examples/disaggregated/slurm/benchmark/README.md`	Updated to document split worker/server config generation, new scripts (`gen_worker_config.py`, `gen_server_config.py`, `start_server.sh`), updated `start_worker.sh` signature, per-group orchestration, and revised benchmark/run workflow and logs.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant SLURM as disaggr_torch.slurm
  participant GEN as GEN workers (srun)
  participant CTX as CTX workers (srun)
  participant START_SRV as start_server.sh
  participant GEN_SRV_CFG as gen_server_config.py
  participant SERVER as trtllm-serve
  participant BENCH as run_benchmark.sh

  SLURM->>SLURM: enumerate nodes, compute ntasks_per_node\nsplit nodes into gen_nodes / ctx_nodes
  SLURM->>GEN: srun -> start_worker.sh "GEN" index ... work_dir
  SLURM->>CTX: srun -> start_worker.sh "CTX" index ... work_dir
  GEN->>GEN: write hostname file (GEN_index)\nlaunch trtllm-serve via llmapi-launch (optional NSYS)
  CTX->>CTX: write hostname file (CTX_index)\nlaunch trtllm-serve via llmapi-launch
  SLURM->>START_SRV: start_server.sh num_ctx num_gen work_dir script_dir
  START_SRV->>GEN_SRV_CFG: poll work_dir/hostnames/*\nassemble server_config.yaml
  GEN_SRV_CFG->>START_SRV: server_config.yaml ready
  START_SRV->>SERVER: trtllm-serve -c server_config.yaml
  SLURM->>BENCH: run_benchmark.sh ${logdir}
  BENCH->>SERVER: wait for /health OK\nrun workload
  BENCH->>BENCH: collect per-worker logs (output_gen_*.log / output_ctx_*.log)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

[None][chore] update disagg readme and scripts for pipeline parallelism #6875 — Overlaps on disaggregated SLURM benchmark tooling, disaggr_torch.slurm, README, and config-generation logic.
[None] [chore] Update wide-ep genonly scripts #6995 — Modifies SLURM benchmark worker invocation/signature; directly related to start_worker.sh and worker launch semantics.
doc: Refactor documents and examples of disaggregated serving and wide ep #6054 — Edits disaggregated SLURM example scripts and config generation; likely intersects on removed/renamed generators.

Suggested labels

Documentation

Suggested reviewers

qiaoxj07
chzblych
Shixiaowei02
chuangz0

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.

📜 Recent review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 0eac701 and af7dde0.

📒 Files selected for processing (1)

examples/disaggregated/slurm/benchmark/gen_worker_config.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

examples/disaggregated/slurm/benchmark/gen_worker_config.py

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

✨ Finishing Touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

Visit our Status Page to check the current availability of CodeRabbit.
Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 9

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)

examples/disaggregated/slurm/benchmark/run_benchmark.sh (1)

64-78: Health check loop is logically incorrect; it won’t wait on non-200 responses

Using “while ! curl …” only checks curl’s exit status, not the HTTP code. 4xx/5xx will pass immediately. Compare the code to 200 explicitly.

-# check server is health by curl every 10 seconds timeout 1800 seconds
-timeout=1800
-start_time=$(date +%s)
-while ! curl -s -o /dev/null -w "%{http_code}" http://${hostname}:${port}/health; do
-    current_time=$(date +%s)
-    elapsed=$((current_time - start_time))
-    if [ $elapsed -ge $timeout ]; then
-        echo "Error: Server is not healthy after ${timeout} seconds"
-        exit 1
-    fi
-    if [ $((elapsed % 30)) -eq 0 ]; then
-        echo "Waiting for server to be healthy... (${elapsed}s elapsed)"
-    fi
-    sleep 10
-done
+# Check server health every 10s, timeout 1800s
+timeout=1800
+start_time=$(date +%s)
+while true; do
+    http_code=$(curl -s -o /dev/null -w "%{http_code}" "http://${hostname}:${port}/health" || echo "000")
+    if [ "${http_code}" = "200" ]; then
+        break
+    fi
+    current_time=$(date +%s)
+    elapsed=$((current_time - start_time))
+    if [ $elapsed -ge $timeout ]; then
+        echo "Error: Server is not healthy after ${timeout} seconds (last code=${http_code})"
+        exit 1
+    fi
+    if [ $((elapsed % 30)) -eq 0 ]; then
+        echo "Waiting for server to be healthy... (${elapsed}s elapsed, last code=${http_code})"
+    fi
+    sleep 10
+done

examples/disaggregated/slurm/benchmark/gen_yaml.py (2)

7-23: Unify default for cache_transceiver_max_num_tokens with CLI (8448)

Function default is 4608 while CLI default is 8448. Aligning avoids surprises for direct callers.

-                    cache_transceiver_max_num_tokens: int = 4608) -> None:
+                    cache_transceiver_max_num_tokens: int = 8448) -> None:

23-45: Docstring is stale and references removed parameters

Docstring parameters (e.g., config_path, model_path, num_ctx_servers, server_port) no longer exist. Update to the current signature so downstream users don’t misconfigure.

-    """
-    Generate configuration YAML file for disaggregated inference.
-
-    Args:
-        config_path: Path to save the config file
-        model_path: Path to the model
-        num_ctx_servers: Number of context servers
-        ctx_tp_size: Tensor parallel size for context servers
-        ctx_batch_size: Batch size for context servers
-        ctx_max_num_tokens: Max number of tokens for context servers
-        ctx_max_seq_len: Max sequence length for context servers
-        ctx_free_gpu_memory_fraction: Free GPU memory fraction for context servers
-        ctx_enable_attention_dp: Enable attention DP for context servers
-        num_gen_servers: Number of generation servers
-        gen_tp_size: Tensor parallel size for generation servers
-        gen_batch_size: Batch size for generation servers
-        gen_max_num_tokens: Max number of tokens for generation servers
-        gen_enable_attention_dp: Enable attention DP for generation servers
-        gen_gpu_memory_fraction: GPU memory fraction for generation servers
-        eplb_num_slots: Number of slots for eplb
-        worker_start_port: Start port for workers
-        server_port: Server port
-    """
+    """
+    Generate ctx/gen YAML configs for disaggregated inference and write them to work_dir.
+
+    Args:
+        work_dir: Output directory for ctx_config.yaml and gen_config.yaml.
+        ctx_tp_size: TP size for context servers.
+        ctx_batch_size: Max batch size for context servers.
+        ctx_max_num_tokens: Max tokens for context servers.
+        ctx_max_seq_len: Max sequence length for context servers.
+        ctx_free_gpu_memory_fraction: KV cache free memory fraction (0..1) for context servers.
+        ctx_enable_attention_dp: Whether to enable Attention DP for context servers.
+        gen_tp_size: TP size for generation servers.
+        gen_batch_size: Max batch size for generation servers.
+        gen_max_num_tokens: Max tokens for generation servers.
+        gen_max_seq_len: Max sequence length for generation servers.
+        gen_enable_attention_dp: Whether to enable Attention DP for generation servers.
+        gen_gpu_memory_fraction: KV cache free memory fraction (0..1) for generation servers.
+        eplb_num_slots: MOE load-balancer slots; if > 0, writes moe_load_balancer.yaml and references it.
+        mtp_size: Number of layers for MTP speculative decoding; if > 0, adds speculative_config to both.
+        cache_transceiver_max_num_tokens: Max tokens in cache transceiver buffer (both roles).
+    """

examples/disaggregated/slurm/benchmark/disaggr_torch.slurm (1)

153-155: Stale reference to config.yaml (now split configs); remove or update

The pipeline now generates ctx_config.yaml/gen_config.yaml (and server_config.yaml later). Reading ${full_logdir}/config.yaml will fail. Unless another step writes this file, drop these lines or point to the correct server_config.yaml when it exists.

Suggested change:
-hostname_value=$(grep '^hostname:' ${full_logdir}/config.yaml | awk -F': ' '{print $2}')
-echo "server host name: $hostname_value"
+# Server hostname is derived from server_config.yaml created by start_server.sh/get_server_config.py.

🧹 Nitpick comments (14)

examples/disaggregated/slurm/benchmark/get_server_config.py (3)

1-7: Add NVIDIA 2025 copyright header

Per repository guidelines, prepend the NVIDIA copyright header to Python sources.

+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
+
 import argparse
 import os
 import socket
 import time

47-47: Fix long line flagged by Ruff (E501)

Break the long f-string into shorter parts.

-        print(
-            f"Waiting for hostnames to be found in {hostnames_folder}, current length: {len(hostnames)}, expected length: {args.num_ctx_servers + args.num_gen_servers}"
-        )
+        print(
+            "Waiting for hostnames to be found in "
+            f"{hostnames_folder}, current length: {len(hostnames)}, "
+            f"expected length: {args.num_ctx_servers + args.num_gen_servers}"
+        )

86-90: Use safe_dump and explicit UTF-8 encoding for YAML output

Safer and yields stable formatting.

-    with open(os.path.join(args.work_dir, "server_config.yaml"), "w") as f:
-        yaml.dump(server_config, f)
+    out_path = os.path.join(args.work_dir, "server_config.yaml")
+    with open(out_path, "w", encoding="utf-8") as f:
+        yaml.safe_dump(server_config, f, default_flow_style=False, sort_keys=False)
-    print(
-        f"Server config file {os.path.join(args.work_dir, 'server_config.yaml')} generated"
-    )
+    print(f"Server config file {out_path} generated")

examples/disaggregated/slurm/benchmark/run_benchmark.sh (3)

35-46: Quote paths when waiting for config file to avoid glob/space issues

Minor robustness tweak.
-while [ ! -f ${config_file} ]; do
+while [ ! -f "${config_file}" ]; do
49-56: YAML parsing by grep is brittle; minimally constrain matches

If comments or similarly named keys are added, this can misparse. Consider yq or Python for robust parsing. As a minimal fix, anchor to start and ignore commented lines.
-hostname=$(grep -i "hostname:" ${config_file} | awk '{print $2}')
-port=$(grep -i "port:" ${config_file} | awk '{print $2}')
+hostname=$(grep -iE '^[[:space:]]*hostname:' "${config_file}" | grep -v '^[[:space:]]*#' | awk '{print $2}')
+port=$(grep -iE '^[[:space:]]*port:' "${config_file}" | grep -v '^[[:space:]]*#' | awk '{print $2}')
If you prefer a robust approach without extra dependencies, I can replace this with a tiny python -c snippet that safe_loads the YAML.

127-138: Avoid kill -9 and limit scope to the job when possible

SIGKILL skips cleanup and the grep patterns may hit unrelated processes. Prefer TERM with a timeout; and restrict to the current user or SLURM job.
-kill -9 $(ps aux | grep '[s]tart_server.sh' | awk '{print $2}') >/dev/null 2>&1 || true
-kill -9 $(ps aux | grep '[s]tart_worker.sh' | awk '{print $2}') >/dev/null 2>&1 || true
-kill -9 $(ps aux | grep '[t]rtllm-serve' | awk '{print $2}') >/dev/null 2>&1 || true
-sleep 20  # Give processes some time to clean up
+pkill -TERM -u "${USER}" -f '[s]tart_server.sh' >/dev/null 2>&1 || true
+pkill -TERM -u "${USER}" -f '[s]tart_worker.sh' >/dev/null 2>&1 || true
+pkill -TERM -u "${USER}" -f '[t]rtllm-serve'  >/dev/null 2>&1 || true
+sleep 20  # Give processes some time to clean up
+# Force kill only if still alive
+pgrep -u "${USER}" -f '[t]rtllm-serve' >/dev/null && pkill -KILL -u "${USER}" -f '[t]rtllm-serve' || true

examples/disaggregated/slurm/benchmark/gen_yaml.py (2)

1-4: Add NVIDIA 2025 copyright header

Comply with the project’s source header requirement.

+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
+
 import argparse
 import os

132-139: Ensure work_dir exists and use safe_dump with UTF-8

Create the directory, write YAML using safe_dump, and disable key sorting for readability.

-    ctx_config_file = os.path.join(work_dir, "ctx_config.yaml")
-    gen_config_file = os.path.join(work_dir, "gen_config.yaml")
-    with open(ctx_config_file, "w") as f:
-        yaml.dump(ctx_config, f, default_flow_style=False, sort_keys=False)
-    with open(gen_config_file, "w") as f:
-        yaml.dump(gen_config, f, default_flow_style=False, sort_keys=False)
+    os.makedirs(work_dir, exist_ok=True)
+    ctx_config_file = os.path.join(work_dir, "ctx_config.yaml")
+    gen_config_file = os.path.join(work_dir, "gen_config.yaml")
+    with open(ctx_config_file, "w", encoding="utf-8") as f:
+        yaml.safe_dump(ctx_config, f, default_flow_style=False, sort_keys=False)
+    with open(gen_config_file, "w", encoding="utf-8") as f:
+        yaml.safe_dump(gen_config, f, default_flow_style=False, sort_keys=False)

examples/disaggregated/slurm/benchmark/start_worker.sh (2)

43-46: Quote variable expansions when writing hostnames

Prevents word splitting and globbing issues; also avoid an unnecessary subshell.

-    mkdir -p ${work_dir}/hostnames/
-    echo $(hostname) > ${work_dir}/hostnames/${role}_${instance_id}.txt
-    echo "hostname saved to ${work_dir}/hostnames/${role}_${instance_id}.txt"
+    mkdir -p "${work_dir}/hostnames"
+    hostname > "${work_dir}/hostnames/${role}_${instance_id}.txt"
+    echo "hostname saved to ${work_dir}/hostnames/${role}_${instance_id}.txt"

49-68: NSYS prefix handling and quoting

ShellCheck warnings (SC2089/SC2090/SC2046) indicate quoting will be treated literally and word-splitting may occur. Use arrays for NSYS args and quote expansions elsewhere.

-if [ -z "${nsys_folder}" ]; then
+if [ -z "${nsys_folder}" ]; then
     echo "nsys is not enabled, start normal flow"
-    trtllm-llmapi-launch trtllm-serve ${model_path} --host $(hostname) --port ${port} --extra_llm_api_options ${config_file}
+    trtllm-llmapi-launch trtllm-serve "${model_path}" --host "$(hostname)" --port "${port}" --extra_llm_api_options "${config_file}"
 else
     nsys_prefix=""
-    nsys_file=${nsys_folder}/nsys_worker_proc_${instance_id}_${SLURM_PROCID}
+    nsys_file="${nsys_folder}/nsys_worker_proc_${instance_id}_${SLURM_PROCID}"
     export TLLM_PROFILE_RECORD_GC=1
     export TLLM_NVTX_DEBUG=1
     if [ "${role}" = "GEN" ]; then
         export TLLM_PROFILE_START_STOP=200-250
-        nsys_prefix="nsys profile -e \"NSYS_MPI_STORE_TEAMS_PER_RANK=1\" -o ${nsys_file} -f true -t cuda,nvtx,python-gil -c cudaProfilerApi --cuda-graph-trace node --capture-range-end=stop --gpu-metrics-devices=none"
-        echo "nsys_prefix: ${nsys_prefix}"
+        # Build NSYS args as an array to preserve quoting:
+        nsys_args=(profile -e "NSYS_MPI_STORE_TEAMS_PER_RANK=1" -o "${nsys_file}" -f true -t cuda,nvtx,python-gil -c cudaProfilerApi --cuda-graph-trace node --capture-range-end=stop --gpu-metrics-devices=none)
+        echo "nsys_file: ${nsys_file}"
     elif [ "${role}" = "CTX" ]; then
         echo "nsys is not enabled on ctx_gpus"
     fi
-    trtllm-llmapi-launch ${nsys_prefix} \
-        trtllm-serve ${model_path} \
-            --host $(hostname) --port ${port} \
-            --extra_llm_api_options ${config_file}
+    if [ "${role}" = "GEN" ]; then
+        trtllm-llmapi-launch nsys "${nsys_args[@]}" \
+            trtllm-serve "${model_path}" \
+                --host "$(hostname)" --port "${port}" \
+                --extra_llm_api_options "${config_file}"
+    else
+        trtllm-llmapi-launch trtllm-serve "${model_path}" \
+            --host "$(hostname)" --port "${port}" \
+            --extra_llm_api_options "${config_file}"
+    fi
 fi

examples/disaggregated/slurm/benchmark/disaggr_torch.slurm (4)

95-96: Typo in log directory label ("tep" vs. "tp") will fragment result paths

The suffix should be consistent; use "tp" for tensor parallel size. Right now line 82 uses "dep" and this branch uses "tep". Pick one ("tp" is conventional) across both cases.

Apply this diff here, and mirror the same rename at Line 82 to keep the directory pattern stable:
-    full_logdir=${logdir}/ctx${num_ctx_servers}_gen${num_gen_servers}_tep${gen_tp_size}_batch${gen_batch_size}_eplb${eplb_num_slots}_mtp${mtp_size}
+    full_logdir=${logdir}/ctx${num_ctx_servers}_gen${num_gen_servers}_tp${gen_tp_size}_batch${gen_batch_size}_eplb${eplb_num_slots}_mtp${mtp_size}
158-159: pid_list is collected but unused; add a trap to ensure cleanup of background sruns

Without cleanup, backgrounded sruns may linger if the job aborts. Add a trap to kill any remaining PIDs on EXIT/INT/TERM.

Inject right after pid_list initialization:
 pid_list=""
 
+cleanup() {
+  for pid in $pid_list; do
+    if kill -0 "$pid" >/dev/null 2>&1; then
+      echo "Cleaning up PID $pid"
+      kill "$pid" || true
+    fi
+  done
+}
+trap cleanup EXIT INT TERM
121-126: Install step only runs in the base container; per-worker/server containers may miss TRT-LLM deps

You install with --container-name=${container_name} ("disaggr"), but workers/servers run in distinct containers (container_name_ctx_, container_name_gen_, container_name_server). Unless the image is pre-baked, those containers won’t see the editable install.

Options:

Pre-bake the image with TRT-LLM and remove the on-cluster install.

Or replicate the install for each container type before launching workers/servers, e.g.:
# Example: install in all upcoming container names (adjust counts accordingly)
for i in $(seq 0 $((num_ctx_servers - 1))); do
  srun --container-name=${container_name}_ctx_${i} --container-image=${container_image} \
       --container-mounts=${mounts} --mpi=pmix --overlap -N 1 -n 1 \
       bash -c "cd ${trtllm_repo} && pip install -e ."
done
for i in $(seq 0 $((num_gen_servers - 1))); do
  srun --container-name=${container_name}_gen_${i} --container-image=${container_image} \
       --container-mounts=${mounts} --mpi=pmix --overlap -N 1 -n 1 \
       bash -c "cd ${trtllm_repo} && pip install -e ."
done
srun --container-name=${container_name}_server --container-image=${container_image} \
     --container-mounts=${mounts} --mpi=pmix --overlap -N 1 -n 1 \
     bash -c "cd ${trtllm_repo} && pip install -e ."
128-149: gen_yaml.py flags are correct; add file existence checks

I confirmed via rg that gen_yaml.py defines both --ctx_free_gpu_memory_fraction (line 166) and --gen_gpu_memory_fraction (line 194), matching the flags passed in the SLURM script—no renaming needed.

Next, to fail fast if the YAML artifacts aren’t generated, add a post-generation check. For example, immediately after the srun in examples/disaggregated/slurm/benchmark/disaggr_torch.slurm, insert:
# Verify that both config files were created
if [[ ! -f "${full_logdir}/ctx_config.yaml" || ! -f "${full_logdir}/gen_config.yaml" ]]; then
  echo "ERROR: Missing ctx/gen YAMLs in ${full_logdir}" >&2
  exit 1
fi
This ensures that any failure in gen_yaml.py is caught early rather than silently proceeding.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between c7269ea and 6fd6704.

📒 Files selected for processing (6)

examples/disaggregated/slurm/benchmark/disaggr_torch.slurm (3 hunks)
examples/disaggregated/slurm/benchmark/gen_yaml.py (5 hunks)
examples/disaggregated/slurm/benchmark/get_server_config.py (1 hunks)
examples/disaggregated/slurm/benchmark/run_benchmark.sh (3 hunks)
examples/disaggregated/slurm/benchmark/start_server.sh (1 hunks)
examples/disaggregated/slurm/benchmark/start_worker.sh (2 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

examples/disaggregated/slurm/benchmark/get_server_config.py
examples/disaggregated/slurm/benchmark/gen_yaml.py

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

examples/disaggregated/slurm/benchmark/get_server_config.py
examples/disaggregated/slurm/benchmark/gen_yaml.py

🪛 Ruff (0.12.2)

examples/disaggregated/slurm/benchmark/get_server_config.py

47-47: Line too long (168 > 120)

(E501)

🪛 Shellcheck (0.10.0)

examples/disaggregated/slurm/benchmark/start_worker.sh

[warning] 44-44: Quote this to prevent word splitting.

(SC2046)

[warning] 51-51: Quote this to prevent word splitting.

(SC2046)

[warning] 59-59: Quotes/backslashes will be treated literally. Use an array.

(SC2089)

[warning] 64-64: Quotes/backslashes in this variable will not be respected.

(SC2090)

[warning] 66-66: Quote this to prevent word splitting.

(SC2046)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (3)

examples/disaggregated/slurm/benchmark/get_server_config.py (1)

68-85: Confirm schema fields match trtllm-serve expectations

The keys 'backend', 'context_servers/generation_servers.num_instances', and 'urls' should match disaggregated server schema. Please confirm that:

backend: 'pytorch' is valid here

urls expect "host:port" without protocol

num_instances must equal len(urls)

If the schema differs, the server may fail to load.

Would you like me to add a quick YAML schema check in CI to validate the generated file before server launch?

examples/disaggregated/slurm/benchmark/gen_yaml.py (2)

105-107: Confirm 'MNNVL' allreduce strategy identifier

The string "MNNVL" may be a specific strategy code. Please verify the exact spelling and casing match the runtime’s expected values.

If needed, I can auto-scan for known allreduce strategies in the codebase with a follow-up script.

71-76: All MOE backend strings are valid

I’ve confirmed that the three values used in examples/disaggregated/slurm/benchmark/gen_yaml.py
– "CUTLASS", "WIDEEP", and "TRTLLM"
– are part of the allowed set in the runtime’s configuration:

• In tensorrt_llm/llmapi/llm_args.py, the MoeConfig.backend field is defined as
Literal["CUTLASS", "CUTEDSL", "WIDEEP", "TRTLLM", "DEEPGEMM", "VANILLA", "TRITON"]
confirming that each of the three strings is accepted .

No changes are required in the snippet; the strings align exactly with the runtime’s enum.

examples/disaggregated/slurm/benchmark/disaggr_torch.slurm

examples/disaggregated/slurm/benchmark/get_server_config.py

examples/disaggregated/slurm/benchmark/start_server.sh

examples/disaggregated/slurm/benchmark/start_worker.sh

examples/disaggregated/slurm/benchmark/disaggr_torch.slurm

examples/disaggregated/slurm/benchmark/get_server_config.py

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (5)

examples/disaggregated/slurm/benchmark/disaggr_torch.slurm (5)
146-146: mtp_size now used in YAML generation and log key; earlier concern addressed

mtp_size is passed to gen_yaml.py and included in the logdir pattern, so the previous “unused” remark is resolved.

111-113: nsys variable mismatch: define nsys_folder (workers consume it), not nsys_on

Workers are invoked with ${nsys_folder}, but the script only defines nsys_on; profiling path will be empty and silently misconfigured. Standardize on nsys_folder and (optionally) create the directory when enabled.

Apply:
-nsys_on=""
-# nsys_on=${full_logdir} # Uncomment this line to enable Nsys profiling
+nsys_folder=""
+# nsys_folder=${full_logdir}/nsys # Uncomment to enable Nsys profiling; workers will write profiles here
+# [ -n "${nsys_folder}" ] && mkdir -p "${nsys_folder}"
158-170: CTX worker launch: undefined vars, invalid --segment, and path/name mismatches

Problems:

ctx_nodes_num and gpus_per_node are not defined anywhere.

--segment is not a valid srun option.

Uses ${work_dir} (undefined) instead of ${workdir}.

Passes ${model_path} (undefined) instead of ${model_dir}.

Depends on ${nsys_folder} which is currently not defined (see earlier comment).

Minimal safe fix (single node per CTX group; drop invalid/undefined options; fix var names):
 for i in $(seq 0 $((num_ctx_servers - 1))); do
-    srun -l -N ${ctx_nodes_num} \
-        --ntasks=${ctx_tp_size} \
-        --ntasks-per-node=${gpus_per_node} \
-        --segment=${ctx_nodes_num} \
-        --container-image=${container_image} \
-        --container-name=${container_name} \
-        --container-mounts=${mounts} \
-        --mpi=pmix \
-        bash ${work_dir}/start_worker.sh "CTX" ${i} ${model_path} "8336" ${benchmark_mode} ${concurrency} ${enable_pdl} ${full_logdir} ${nsys_folder} \
-            &> ${full_logdir}/output_ctx_${i}.log &
+    srun -l -N 1 \
+        --ntasks=${ctx_tp_size} \
+        --container-image=${container_image} \
+        --container-name=${container_name} \
+        --container-mounts=${mounts} \
+        --mpi=pmix \
+        bash ${workdir}/start_worker.sh "CTX" ${i} ${model_dir} "8336" ${benchmark_mode} ${concurrency} ${enable_pdl} ${full_logdir} ${nsys_folder} \
+            &> ${full_logdir}/output_ctx_${i}.log &
     pid_list="${pid_list} $!"
 done
If you intend true multi-node per CTX group, define per-node GPU capacity and derive ctx_nodes_num before the loop:
# Place near other derived vars:
gpus_per_node="${SLURM_GPUS_ON_NODE:-${SLURM_NTASKS_PER_NODE:-1}}"
ctx_nodes_num=$(( (ctx_tp_size + gpus_per_node - 1) / gpus_per_node ))
Then use: -N "${ctx_nodes_num}" --ntasks="${ctx_tp_size}" --ntasks-per-node="${gpus_per_node}", and keep --segment removed.

Quick check:
#!/bin/bash
rg -n 'ctx_nodes_num|gpus_per_node|work_dir\b|model_path\b' examples/disaggregated/slurm/benchmark/disaggr_torch.slurm
173-185: GEN worker launch: mirror the CTX fixes; same undefined vars/--segment/path issues

gen_nodes_num and gpus_per_node are not defined.

--segment is invalid.

${model_path} is undefined; use ${model_dir}.

Minimal safe fix:
 for i in $(seq 0 $((num_gen_servers - 1))); do
-    srun -l -N ${gen_nodes_num} \
-        --ntasks=${gen_tp_size} \
-        --ntasks-per-node=${gpus_per_node} \
-        --segment=${gen_nodes_num} \
-        --container-image=${container_image} \
-        --container-name=${container_name} \
-        --container-mounts=${mounts} \
-        --mpi=pmix \
-        bash ${workdir}/start_worker.sh "GEN" ${i} ${model_path} "8336" ${benchmark_mode} ${concurrency} ${enable_pdl} ${full_logdir} ${nsys_folder} \
-            &> ${full_logdir}/output_gen_${i}.log &
+    srun -l -N 1 \
+        --ntasks=${gen_tp_size} \
+        --container-image=${container_image} \
+        --container-name=${container_name} \
+        --container-mounts=${mounts} \
+        --mpi=pmix \
+        bash ${workdir}/start_worker.sh "GEN" ${i} ${model_dir} "8336" ${benchmark_mode} ${concurrency} ${enable_pdl} ${full_logdir} ${nsys_folder} \
+            &> ${full_logdir}/output_gen_${i}.log &
     pid_list="${pid_list} $!"
 done
If multi-node GEN groups are desired, compute:
gpus_per_node="${SLURM_GPUS_ON_NODE:-${SLURM_NTASKS_PER_NODE:-1}}"
gen_nodes_num=$(( (gen_tp_size + gpus_per_node - 1) / gpus_per_node ))
Then use -N "${gen_nodes_num}" --ntasks="${gen_tp_size}" --ntasks-per-node="${gpus_per_node}" and omit --segment.

189-193: Server start: work_dir vs workdir mismatch (undefined var will break server start)

The final srun passes ${work_dir}, but only ${workdir} is defined elsewhere. Fix the argument to pass the correct path. Also, consider not mixing --container-image and --container-name in the same step unless you intend to start a fresh container for the server step.

Mandatory fix:
 srun -l --container-name=${container_name} \
-    --container-image=${container_image} \
     --container-mounts=${mounts} \
     --mpi=pmix --overlap -N 1 -n 1 \
-    bash ${workdir}/start_server.sh ${num_ctx_servers} ${num_gen_servers} ${full_logdir} ${work_dir} \
+    bash ${workdir}/start_server.sh ${num_ctx_servers} ${num_gen_servers} ${full_logdir} ${workdir} \
         &> ${full_logdir}/output_server.log &
Optional cleanup: If you want to reuse the already named container, you can drop --container-image here to avoid starting a new container instance for the server step.

🧹 Nitpick comments (2)

examples/disaggregated/slurm/benchmark/disaggr_torch.slurm (2)
1-9: Make the script fail-fast and clean up reliably

Add strict mode and a trap to terminate background worker/server steps if any step fails; this prevents hung jobs and orphaned processes.

Apply near the top (after shebang):
 #!/bin/bash
+set -euo pipefail
+IFS=$'\n\t'
+
+# Ensure background tasks are cleaned up on exit
+cleanup() {
+  srun -l --container-name=${container_name} --container-mounts=${mounts} --mpi=pmix --overlap \
+    kill -9 $(ps aux | grep '[t]rtllm-serve' | awk '{print $2}') >/dev/null 2>&1 || true
+}
+trap cleanup EXIT
Optionally add --kill-on-bad-exit=1 to critical srun steps (YAML gen, workers, server, benchmark) so failures propagate.

128-150: Optional: Unify GPU memory‐fraction flag names for clarity

I verified that in examples/disaggregated/slurm/benchmark/gen_yaml.py you’ve defined:

--ctx_free_gpu_memory_fraction (default 0.75)

--gen_gpu_memory_fraction (default 0.8)

and your SLURM script correctly invokes those exact flags. However, the asymmetry—one flag containing “free” and the other not—can be confusing. Consider a one-time refactor to align them:

• Option A: Rename the context flag to --ctx_gpu_memory_fraction
• Option B: Rename the generation flag to --gen_free_gpu_memory_fraction

Then update both the add_argument calls in gen_yaml.py and the corresponding --ctx_*/--gen_* invocations in disaggr_torch.slurm so they match.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 6fd6704 and af0e2f1.

📒 Files selected for processing (1)

examples/disaggregated/slurm/benchmark/disaggr_torch.slurm (3 hunks)

🔇 Additional comments (1)

examples/disaggregated/slurm/benchmark/disaggr_torch.slurm (1)

79-79: Container name standardization LGTM

Renaming to disaggregated_serving improves clarity and aligns with the split server/worker flow. No action needed.

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

examples/disaggregated/slurm/benchmark/disaggr_torch.slurm (1)
1-1: Add strict bash options to fail fast and preserve upstream exit codes

Pipelines like "python3 … | tee …" will otherwise hide failures because tee’s status wins. Enable strict mode right after the shebang.
 #!/bin/bash
+set -Eeuo pipefail

♻️ Duplicate comments (6)

examples/disaggregated/slurm/benchmark/disaggr_torch.slurm (6)
111-113: nsys variable mismatch — standardize on nsys_folder (path), not nsys_on (flag)

The worker launch passes a path-like last argument; keep the name semantic. Align the variable and comment accordingly. This also addresses a prior review.
-nsys_on=""
-# nsys_on=${full_logdir} # Uncomment this line to enable Nsys profiling
+nsys_folder=""
+# nsys_folder=${full_logdir}/nsys # Uncomment this line to enable Nsys profiling; profiles will be written here
128-150: YAML generation: good structure; verify arg names and capture errors

The call is clear and well-logged. Two asks:

Confirm gen_yaml.py expects --ctx_free_gpu_memory_fraction (ctx) but --gen_gpu_memory_fraction (gen). If both sides expect “free” or “gpu”, make the naming consistent.

With set -o pipefail added, python failures will now propagate despite tee, which is what we want.

Also, answering an earlier question: mtp_size is used both in the logdir naming and here as --mtp_size, so it is not dead.

177-193: GEN worker srun: invalid flag, and undefined/mismatched variables in command

--segment is not a valid srun option.

work_dir/model_path/nsys_on are undefined or mismatched with earlier variables (workdir/model_dir/nsys_folder).

Specifying both --ntasks and --ntasks-per-node can overconstrain placement when gen_tp_size is not a multiple of gpus_per_node.

Minimal, robust fix:
 srun -l -N ${gen_nodes_num} \
-        --ntasks=${gen_tp_size} \
-        --ntasks-per-node=${gpus_per_node} \
-        --segment=${gen_nodes_num} \
+        --ntasks=${gen_tp_size} \
         --container-image=${container_image} \
         --container-name=${container_name} \
         --container-mounts=${mounts} \
         --nodelist=$(IFS=,; echo "${node_list[*]}") \
         --mpi=pmix \
-        bash ${work_dir}/start_worker.sh "GEN" ${i} ${model_path} "8336" ${benchmark_mode} ${concurrency} ${enable_pdl} ${full_logdir} ${nsys_on} \
+        bash ${workdir}/start_worker.sh "GEN" ${i} ${model_dir} "8336" ${benchmark_mode} ${concurrency} ${enable_pdl} ${full_logdir} ${nsys_folder} \
         &> ${full_logdir}/output_gen_${i}.log &
Optionally, if you must constrain per-node tasks, compute a remainder-aware layout and launch per-node sruns; otherwise, rely on Slurm to distribute -n across -N.

195-213: CTX worker srun: same issues as GEN block

Mirror the GEN fixes: drop --segment, align variable names, and avoid overconstraining placement unless you ensure divisibility.
 srun -l -N ${ctx_nodes_num} \
-        --ntasks=${ctx_tp_size} \
-        --ntasks-per-node=${gpus_per_node} \
-        --segment=${ctx_nodes_num} \
+        --ntasks=${ctx_tp_size} \
         --container-image=${container_image} \
         --container-name=${container_name} \
         --container-mounts=${mounts} \
         --nodelist=$(IFS=,; echo "${node_list[*]}") \
         --mpi=pmix \
-        bash ${work_dir}/start_worker.sh "CTX" ${i} ${model_path} "8336" ${benchmark_mode} ${concurrency} ${enable_pdl} ${full_logdir} ${nsys_on} \
+        bash ${workdir}/start_worker.sh "CTX" ${i} ${model_dir} "8336" ${benchmark_mode} ${concurrency} ${enable_pdl} ${full_logdir} ${nsys_folder} \
             &> ${full_logdir}/output_ctx_${i}.log &
218-223: Server start: work_dir vs workdir mismatch breaks argv to start_server.sh

The last argument is ${work_dir}, which is undefined. Use ${workdir} to keep consistency. This was flagged previously as well.
 srun -l --container-name=${container_name} \
     --container-image=${container_image} \
     --container-mounts=${mounts} \
     --mpi=pmix --overlap -N 1 -n 1 \
-    bash ${workdir}/start_server.sh ${num_ctx_servers} ${num_gen_servers} ${full_logdir} ${work_dir} \
+    bash ${workdir}/start_server.sh ${num_ctx_servers} ${num_gen_servers} ${full_logdir} ${workdir} \
         &> ${full_logdir}/output_server.log &
154-156: gpus_per_node is undefined; node count math will fail

You compute ctx_nodes_num/gen_nodes_num using gpus_per_node, but gpus_per_node is not set anywhere. Parse it from Slurm env (SLURM_TASKS_PER_NODE/SLURM_NTASKS_PER_NODE) with sane fallbacks.
-ctx_nodes_num=$(((ctx_tp_size + gpus_per_node - 1) / gpus_per_node))
-gen_nodes_num=$(((gen_tp_size + gpus_per_node - 1) / gpus_per_node))
+#
+# Derive tasks-per-node (used here as GPUs-per-node) from Slurm env.
+# Handles forms like "4(x2),3" by taking the first numeric group.
+gpus_per_node="${SLURM_TASKS_PER_NODE:-${SLURM_NTASKS_PER_NODE:-}}"
+gpus_per_node="${gpus_per_node%%,*}"       # take first segment
+gpus_per_node="${gpus_per_node%%(*}"       # strip (xN)
+if ! [[ "${gpus_per_node}" =~ ^[0-9]+$ ]]; then gpus_per_node=1; fi
+echo "gpus_per_node: ${gpus_per_node}"
+
+ctx_nodes_num=$(((ctx_tp_size + gpus_per_node - 1) / gpus_per_node))
+gen_nodes_num=$(((gen_tp_size + gpus_per_node - 1) / gpus_per_node))

🧹 Nitpick comments (4)

examples/disaggregated/slurm/benchmark/disaggr_torch.slurm (4)

79-79: Container name reuse across steps — verify plugin semantics

Reusing a single container_name for GEN/CTX/server steps is fine if your Slurm container plugin (e.g., pyxis/enroot) guarantees safe reuse and isolation per job step. If reuse is not guaranteed, consider per-role names (e.g., disaggregated_serving_{gen,ctx,server}) to avoid cross-step side effects.

173-176: Leftover hostnames cleanup without producer

You remove ${full_logdir}/hostnames but the script never generates it anymore. Safe to keep, but consider dropping the cleanup to avoid confusion.

214-214: pid_list is accumulated but never used

If you don’t intend to wait on or manage the background pids explicitly, drop the echo and the accumulation to reduce noise. Alternatively, add a final wait on those to ensure orderly shutdown.

82-82: Minor: standardize logdir token for tensor parallel size

You use "dep" in one path and "tep" in the other; most places use "tp". Standardize to "tp" for grep-ability.

Example (not an exact diff here as these lines weren’t changed in this patch):

ctx…_gen…_dep${gen_tp_size}… → ctx…_gen…_tp${gen_tp_size}…

ctx…_gen…_tep${gen_tp_size}… → ctx…_gen…_tp${gen_tp_size}…

Also applies to: 95-95

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between af0e2f1 and 1bbbead.

📒 Files selected for processing (1)

examples/disaggregated/slurm/benchmark/disaggr_torch.slurm (3 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

examples/disaggregated/slurm/benchmark/disaggr_torch.slurm

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (4)

examples/disaggregated/slurm/benchmark/disaggr_torch.slurm (4)

113-115: Fix Nsys variable mismatch; standardize on nsys_folder and pass it to workers.

Workers should receive a folder path; current var name (nsys_on) is ambiguous and inconsistent with earlier guidance.

-nsys_on=""
-# nsys_on=${full_logdir} # Uncomment this line to enable Nsys profiling
+nsys_folder=""
+# nsys_folder=${full_logdir}/nsys  # Uncomment to enable Nsys profiling; profiles will be saved here

-        bash ${workdir}/start_worker.sh "GEN" ${i} ${model_dir} "8336" ${benchmark_mode} ${concurrency} ${enable_pdl} ${full_logdir} ${nsys_on} \
+        bash ${workdir}/start_worker.sh "GEN" ${i} ${model_dir} "8336" ${benchmark_mode} ${concurrency} ${enable_pdl} ${full_logdir} ${nsys_folder} \

-        bash ${workdir}/start_worker.sh "CTX" ${i} ${model_dir} "8336" ${benchmark_mode} ${concurrency} ${enable_pdl} ${full_logdir} ${nsys_on} \
+        bash ${workdir}/start_worker.sh "CTX" ${i} ${model_dir} "8336" ${benchmark_mode} ${concurrency} ${enable_pdl} ${full_logdir} ${nsys_folder} \

Also applies to: 192-193, 211-213

156-171: Bound ctx_nodes slice and add capacity guard to prevent over-requesting nodes.

Current slicing takes “the rest” of all nodes; also no check against allocation size.

 all_nodes=($(scontrol show hostname $SLURM_NODELIST | sort))
 total_nodes_num=${#all_nodes[@]}
 echo "all_nodes: ${all_nodes[@]}, total_nodes_num: ${total_nodes_num}"
 
 # get the node list for the gen workers
 total_gen_nodes_num=$((gen_nodes_num * num_gen_servers))
 gen_nodes=(${all_nodes[@]:0:${total_gen_nodes_num}})
 echo "gen_nodes: ${gen_nodes[@]}, total_gen_nodes_num: ${total_gen_nodes_num}"
 
 # get the node list for the ctx workers
 total_ctx_nodes_num=$((ctx_nodes_num * num_ctx_servers))
-ctx_nodes=(${all_nodes[@]:${total_gen_nodes_num}:${total_nodes_num}})
+if (( total_gen_nodes_num + total_ctx_nodes_num > total_nodes_num )); then
+    echo "Error: Requested nodes (gen:${total_gen_nodes_num} + ctx:${total_ctx_nodes_num}) exceed allocated nodes (${total_nodes_num})." >&2
+    exit 1
+fi
+ctx_nodes=(${all_nodes[@]:${total_gen_nodes_num}:${total_ctx_nodes_num}})
 echo "ctx_nodes: ${ctx_nodes[@]}, total_ctx_nodes_num: ${total_ctx_nodes_num}"

183-193: Remove invalid srun option --segment in GEN worker launch.

--segment is not a valid srun flag; leaving it will cause runtime failures.

 srun -l -N ${gen_nodes_num} \
         --ntasks=${gen_tp_size} \
-        --ntasks-per-node=${gpus_per_node} \
-        --segment=${gen_nodes_num} \
+        --ntasks-per-node=${gpus_per_node} \
         --container-image=${container_image} \
         --container-name=${container_name} \
         --container-mounts=${mounts} \
         --nodelist=$(IFS=,; echo "${node_list[*]}") \
         --mpi=pmix \

202-213: Remove invalid srun option --segment in CTX worker launch.

Same issue as GEN block; this will fail at runtime.

 srun -l -N ${ctx_nodes_num} \
         --ntasks=${ctx_tp_size} \
-        --ntasks-per-node=${gpus_per_node} \
-        --segment=${ctx_nodes_num} \
+        --ntasks-per-node=${gpus_per_node} \
         --container-image=${container_image} \
         --container-name=${container_name} \
         --container-mounts=${mounts} \
         --nodelist=$(IFS=,; echo "${node_list[*]}") \
         --mpi=pmix \

🧹 Nitpick comments (5)

examples/disaggregated/slurm/benchmark/disaggr_torch.slurm (5)
45-45: Don’t hardcode gpus_per_node; derive from SLURM env (with sane fallback).

Hardcoding 4 risks mismatch with the actual allocation and leads to wrong node math.
-gpus_per_node=4
+# Derive GPUs/tasks per node from SLURM env (overrideable), default to 4 if unknown.
+gpus_per_node="${GPUS_PER_NODE_OVERRIDE:-${SLURM_GPUS_ON_NODE:-${SLURM_NTASKS_PER_NODE:-4}}}"
+echo "gpus_per_node: ${gpus_per_node}"
Run-time check (optional): validate that ctx/gen nodes computed below match $SLURM_NNODES and fail early if not (see capacity guard added in a later comment).

130-152: Pipeline safety: tee masks failures — enable pipefail and fail-fast.

Without pipefail, a failing python in the YAML step won’t stop the job.
 #!/bin/bash
+set -euo pipefail
(With this in place, the gen_yaml.py | tee step will correctly propagate non‑zero exit codes.)

Also applies to: 1-1

183-193: Optional: avoid over-constraining distribution; let Slurm place tasks.

If tp_size isn’t a multiple of gpus_per_node, --ntasks-per-node=${gpus_per_node} can over-constrain placement. It’s safer to omit it and rely on --ntasks with --nodelist.
-        --ntasks-per-node=${gpus_per_node} \
If you need 1 task per GPU, prefer --gpus-per-task=1 (if GPUs are scheduled) and drop --ntasks-per-node.

Also applies to: 202-213

116-122: Container warm-up step is redundant or too narrow.

This creates a container on a single node only; subsequent steps will (re)create containers where needed. Either remove it or scale it to all nodes.

Option A (remove):
-srun -l --container-image=${container_image} \
-        --container-name=${container_name} \
-        --container-mounts=${mounts} \
-        --mpi=pmix \
-        echo "Container up."
Option B (warm up on all nodes):
 srun -l --container-image=${container_image} \
         --container-name=${container_name} \
         --container-mounts=${mounts} \
-        --mpi=pmix \
-        echo "Container up."
+        --mpi=pmix -N "${SLURM_NNODES}" --ntasks-per-node=1 \
+        bash -lc 'echo "Container up on $(hostname)"'
232-237: Kill step likely runs on one node only; scale it across all nodes.

Ensure the cleanup runs everywhere to avoid orphaned servers/workers.
-srun -l --container-name=${container_name} \
-        --container-mounts=${mounts} \
-        --mpi=pmix --overlap \
-        kill -9 $(ps aux | grep '[t]rtllm-serve' | awk '{print $2}') >/dev/null 2>&1 || true
+srun -l --container-name=${container_name} \
+        --container-mounts=${mounts} \
+        --mpi=pmix --overlap -N "${SLURM_NNODES}" --ntasks-per-node=1 \
+        bash -lc "pkill -9 -f '[t]rtllm-serve' || true"

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 1bbbead and 81ecc14.

📒 Files selected for processing (1)

examples/disaggregated/slurm/benchmark/disaggr_torch.slurm (4 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (2)

examples/disaggregated/slurm/benchmark/disaggr_torch.slurm (2)
218-225: LGTM: server start uses consistent workdir and container options.

The earlier work_dir/workdir mismatch appears resolved, and adding container-image/mounts for the server step is correct.

2-4: Verify total resources requested by steps don’t exceed SBATCH allocation.

Given SBATCH has 2 nodes and 8 tasks, confirm that:

total_gen_nodes_num + total_ctx_nodes_num <= $SLURM_NNODES

Sum of concurrent worker step tasks does not exceed available CPUs/GPUs per node

If needed, add a guard before launching workers:
# After computing gen_nodes_num/ctx_nodes_num and slices:
if (( total_gen_nodes_num + total_ctx_nodes_num != SLURM_NNODES )); then
  echo "Warning: Node slices (${total_gen_nodes_num}+${total_ctx_nodes_num}) differ from allocated nodes (${SLURM_NNODES})."
fi
Also applies to: 156-171, 177-215

coderabbitai

Actionable comments posted: 10

♻️ Duplicate comments (2)

examples/disaggregated/slurm/benchmark/disaggr_torch.slurm (2)

115-117: Prefer an explicit nsys_folder and precreate it; current name suggests boolean

The variable name nsys_on reads as a boolean toggle but is passed as a path to workers. Use nsys_folder for clarity and create the directory when enabled. This aligns with earlier guidance.

-nsys_on=""
-# nsys_on=${full_logdir} # Uncomment this line to enable Nsys profiling
+nsys_folder=""
+# nsys_folder=${full_logdir}/nsys  # Uncomment to enable Nsys profiling
+if [[ -n "${nsys_folder}" ]]; then
+  mkdir -p "${nsys_folder}"
+fi

And pass nsys_folder to workers:

-        bash ${workdir}/start_worker.sh "GEN" ${i} ${model_dir} "8336" ${benchmark_mode} ${concurrency} ${enable_pdl} ${full_logdir} ${nsys_on} \
+        bash ${workdir}/start_worker.sh "GEN" ${i} ${model_dir} "8336" ${benchmark_mode} ${concurrency} ${enable_pdl} ${full_logdir} ${nsys_folder} \

-        bash ${workdir}/start_worker.sh "CTX" ${i} ${model_dir} "8336" ${benchmark_mode} ${concurrency} ${enable_pdl} ${full_logdir} ${nsys_on} \
+        bash ${workdir}/start_worker.sh "CTX" ${i} ${model_dir} "8336" ${benchmark_mode} ${concurrency} ${enable_pdl} ${full_logdir} ${nsys_folder} \

158-174: Node slicing: add capacity guard and bound CTX slice to requested size

Currently ctx_nodes is assigned the remainder of all_nodes; if allocation > required, CTX will grab extra nodes; if allocation < required, you won’t fail early. Add a capacity check and slice exactly total_ctx_nodes_num.

 all_nodes=($(scontrol show hostname $SLURM_NODELIST | sort))
 total_nodes_num=${#all_nodes[@]}
 echo "all_nodes: ${all_nodes[@]}, total_nodes_num: ${total_nodes_num}"
@@
 total_gen_nodes_num=$((gen_nodes_num * num_gen_servers))
 gen_nodes=(${all_nodes[@]:0:${total_gen_nodes_num}})
 echo "gen_nodes: ${gen_nodes[@]}, total_gen_nodes_num: ${total_gen_nodes_num}"
@@
 total_ctx_nodes_num=$((ctx_nodes_num * num_ctx_servers))
-ctx_nodes=(${all_nodes[@]:${total_gen_nodes_num}:${total_nodes_num}})
+if (( total_gen_nodes_num + total_ctx_nodes_num > total_nodes_num )); then
+    echo "Error: requested nodes (gen:${total_gen_nodes_num} + ctx:${total_ctx_nodes_num}) exceed allocated nodes (${total_nodes_num})." >&2
+    exit 1
+fi
+ctx_nodes=(${all_nodes[@]:${total_gen_nodes_num}:${total_ctx_nodes_num}})
 echo "ctx_nodes: ${ctx_nodes[@]}, total_ctx_nodes_num: ${total_ctx_nodes_num}"

🧹 Nitpick comments (17)

examples/disaggregated/slurm/benchmark/gen_server_config.py (4)

32-35: Optional: create work_dir if missing instead of raising

Creating the directory (with exist_ok=True) reduces flakiness in orchestrations where the server side may race with the producer of the folder.

-    if not os.path.exists(args.work_dir):
-        raise ValueError(f"Work directory {args.work_dir} not found")
+    if not os.path.exists(args.work_dir):
+        os.makedirs(args.work_dir, exist_ok=True)
+        logger.info("Created work directory %s", args.work_dir)

65-66: Prefer logging over prints for host lists

Move these to INFO so they’re visible in SLURM output but can be silenced via --log_level.

-    print(f"ctx_hostnames: {ctx_hostnames}")
-    print(f"gen_hostnames: {gen_hostnames}")
+    logger.info("ctx_hostnames: %s", ctx_hostnames)
+    logger.info("gen_hostnames: %s", gen_hostnames)

68-71: Nit: comment correctness and structured logging

You get the hostname from the system, not the env. Also prefer logging over print for consistency.

-    # get current hostname from env
+    # Get current hostname from the system
     hostname = socket.gethostname()
-    print(f"Current hostname: {hostname}")
+    logger.info("Current hostname: %s", hostname)

86-90: Use safe YAML dump, explicit encoding, and a single log line

Safer dumper, stable key order, and consistent logging.

-    with open(os.path.join(args.work_dir, "server_config.yaml"), "w") as f:
-        yaml.dump(server_config, f)
-    print(
-        f"Server config file {os.path.join(args.work_dir, 'server_config.yaml')} generated"
-    )
+    out_path = os.path.join(args.work_dir, "server_config.yaml")
+    with open(out_path, "w", encoding="utf-8") as f:
+        yaml.safe_dump(server_config, f, sort_keys=False)
+    logger.info("Server config file %s generated", out_path)

examples/disaggregated/slurm/benchmark/disaggr_torch.slurm (5)

95-100: Echo line reports only gen fraction; include ctx to avoid confusion

The log prints enable_attention_dp for both CTX/GEN but only reports gen_gpu_memory_fraction. Consider echoing ctx_free_gpu_memory_fraction as well for symmetry.
-echo "enable_attention_dp: ${ctx_enable_attention_dp}, ${gen_enable_attention_dp}, gpu_memory_fraction: ${gen_gpu_memory_fraction}"
+echo "enable_attention_dp: CTX=${ctx_enable_attention_dp}, GEN=${gen_enable_attention_dp}, "\
+"gpu_memory_fraction: CTX=${ctx_gpu_memory_fraction}, GEN=${gen_gpu_memory_fraction}"
132-155: Fail fast and surface config generation errors

The gen_worker_config.py stage is critical. Without set -euo pipefail, downstream SRUNs may proceed with partial/missing config. Add strict mode and a trap early; also check that expected artifacts are present.
 #!/bin/bash
+# Fail fast and surface line numbers on error
+set -euo pipefail
+trap 'echo "[ERROR] $(basename "$0") failed at line ${LINENO}" >&2' ERR
@@
 srun -l -N 1 -n 1 \
@@
                 2>&1 | tee ${full_logdir}/gen_worker_config.log
+
+# Optional: verify expected outputs exist before continuing
+if [[ ! -s "${full_logdir}/gen_config/ctx_config.yaml" || ! -s "${full_logdir}/gen_config/gen_config.yaml" ]]; then
+  echo "Missing worker configs under ${full_logdir}/gen_config. See gen_worker_config.log." >&2
+  exit 1
+fi
229-233: Optional: isolate benchmark resource use

The benchmark srun uses the shared container and may contend with servers for CPU cores. Consider pinning CPUs or using --hint/--cpus-per-task if your cluster enforces CPU binding to avoid noisy-neighbor effects in results.

Example:
-        --mpi=pmix --overlap -N 1 -n 1 \
+        --mpi=pmix --overlap -N 1 -n 1 --cpus-per-task=4 \
234-240: Shutdown sequence OK; consider a gentler stop before SIGKILL

kill -9 is effective but skips cleanup. If start_server.sh or workers trap SIGTERM, a brief graceful phase can improve log flush and cleanup.
-        kill -9 $(ps aux | grep '[t]rtllm-serve' | awk '{print $2}') >/dev/null 2>&1 || true
+        pids=$(ps aux | grep '[t]rtllm-serve' | awk '{print $2}') || true
+        [[ -n "${pids}" ]] && kill ${pids} >/dev/null 2>&1 || true
+        sleep 2
+        [[ -n "${pids}" ]] && kill -9 ${pids} >/dev/null 2>&1 || true
45-47: Optional Refactor: Robust SLURM tasks-per-node parsing

In a non-Slurm environment both SLURM_NTASKS_PER_NODE and SLURM_TASKS_PER_NODE are unset (empty), so the proposed snippet correctly falls back to the default of 4.

The sed -E 's/^([0-9]+).*/\1/' will extract “4” from strings like 4(x2),2, covering heterogeneous allocations.

Applying this diff will make ntasks_per_node derivation resilient across Slurm formats while preserving your existing default:
-# Get GPUs per node dynamically from SLURM
-ntasks_per_node=${SLURM_NTASKS_PER_NODE:-4}  # Default to 4 for GB200
+# Get tasks-per-node (used as "gpus per node") robustly; fallback to 4
+ntasks_per_node="${SLURM_NTASKS_PER_NODE:-}"
+if [[ -z "${ntasks_per_node}" ]]; then
+  # SLURM_TASKS_PER_NODE can look like "4(x2),2"; take the first integer
+  ntasks_per_node="$(echo "${SLURM_TASKS_PER_NODE:-}" | sed -E 's/^([0-9]+).*/\1/')" || true
+fi
+ntasks_per_node="${ntasks_per_node:-4}"

examples/disaggregated/slurm/benchmark/gen_worker_config.py (8)

67-69: Deduplicate and normalize CUDA graph batch sizes.

If gen_batch_size equals one of the preset sizes (e.g., 256), the list contains duplicates. Some readers validate uniqueness.

Apply:

-    gen_cuda_graph_batch_sizes = [
-        1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 768, 1024, 2048, gen_batch_size
-    ]
+    gen_cuda_graph_batch_sizes = sorted({
+        1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 768, 1024, 2048, gen_batch_size
+    })

71-76: Streamline MOE backend selection to mutually exclusive branches.

Current code is correct but non-idiomatic. Using if/elif/else clarifies precedence.

-    gen_moe_backend = "CUTLASS"
-    if gen_tp_size >= 16 and gen_enable_attention_dp:
-        gen_moe_backend = "WIDEEP"
-    if not gen_enable_attention_dp:
-        gen_moe_backend = "TRTLLM"
+    if not gen_enable_attention_dp:
+        gen_moe_backend = "TRTLLM"
+    elif gen_tp_size >= 16:
+        gen_moe_backend = "WIDEEP"
+    else:
+        gen_moe_backend = "CUTLASS"

52-52: Redundant bool conversions.

True if <bool> else False is verbose; prefer bool(...).

-        'enable_attention_dp': True if ctx_enable_attention_dp else False,
+        'enable_attention_dp': bool(ctx_enable_attention_dp),

-        'enable_attention_dp': True if gen_enable_attention_dp else False,
+        'enable_attention_dp': bool(gen_enable_attention_dp),

Also applies to: 80-80

57-60: Validate kv_cache_config dtype 'fp8' and consider parameterizing.

Ensure 'fp8' is an accepted dtype for KV cache in current TRT-LLM. If not universally supported, consider exposing as a CLI arg with a safe default.

If parameterizing:

-            'dtype': 'fp8',
+            'dtype': os.environ.get('TRTLLM_KV_DTYPE', 'fp8'),

Or add a --kv_cache_dtype CLI argument used in both configs.

Also applies to: 90-94

132-141: Avoid shadowing the function name with a local variable and improve naming.

Inside the function, gen_config_file (str path) shadows the function identifier. Rename to *_path for clarity.

-    ctx_config_file = os.path.join(work_dir, "ctx_config.yaml")
-    gen_config_file = os.path.join(work_dir, "gen_config.yaml")
-    with open(ctx_config_file, "w") as f:
+    ctx_config_path = os.path.join(work_dir, "ctx_config.yaml")
+    gen_config_path = os.path.join(work_dir, "gen_config.yaml")
+    with open(ctx_config_path, "w") as f:
         yaml.dump(ctx_config, f, default_flow_style=False, sort_keys=False)
-    with open(gen_config_file, "w") as f:
+    with open(gen_config_path, "w") as f:
         yaml.dump(gen_config, f, default_flow_style=False, sort_keys=False)
 
-    print(
-        f"ctx_config_file: {ctx_config_file} gen_config_file: {gen_config_file} generated successfully"
-    )
+    print(f"ctx_config_file: {ctx_config_path} gen_config_file: {gen_config_path} generated successfully")

147-211: Add basic CLI validation for positive ints and (0,1] fractions.

Guardrail invalid inputs early to prevent obscure runtime behavior.

Minimal change: custom argparse type validators and reuse for relevant args.

 import argparse
 import os
 
 import yaml
 
 
+def _fraction01(value: str) -> float:
+    try:
+        x = float(value)
+    except ValueError as e:
+        raise argparse.ArgumentTypeError(f"Invalid float: {value}") from e
+    if not (0.0 < x <= 1.0):
+        raise argparse.ArgumentTypeError(f"Value must be in (0, 1], got {x}")
+    return x
+
+def _pos_int(value: str) -> int:
+    try:
+        x = int(value)
+    except ValueError as e:
+        raise argparse.ArgumentTypeError(f"Invalid int: {value}") from e
+    if x <= 0:
+        raise argparse.ArgumentTypeError(f"Value must be > 0, got {x}")
+    return x
+
     parser = argparse.ArgumentParser()
@@
-    parser.add_argument("--ctx_batch_size",
-                        type=int,
+    parser.add_argument("--ctx_batch_size",
+                        type=_pos_int,
                         default=1,
                         help="Batch size for context servers")
@@
-    parser.add_argument("--ctx_max_num_tokens",
-                        type=int,
+    parser.add_argument("--ctx_max_num_tokens",
+                        type=_pos_int,
                         default=8192,
                         help="Max number of tokens for context servers")
@@
-    parser.add_argument("--ctx_max_seq_len",
-                        type=int,
+    parser.add_argument("--ctx_max_seq_len",
+                        type=_pos_int,
                         default=8192,
                         help="Max sequence length for context servers")
@@
-    parser.add_argument("--ctx_free_gpu_memory_fraction",
-                        type=float,
+    parser.add_argument("--ctx_free_gpu_memory_fraction",
+                        type=_fraction01,
                         default=0.75,
                         help="Free GPU memory fraction for context servers")
@@
-    parser.add_argument("--gen_batch_size",
-                        type=int,
+    parser.add_argument("--gen_batch_size",
+                        type=_pos_int,
                         default=256,
                         help="Batch size for generation servers")
@@
-    parser.add_argument("--gen_max_num_tokens",
-                        type=int,
+    parser.add_argument("--gen_max_num_tokens",
+                        type=_pos_int,
                         default=256,
                         help="Max number of tokens for generation servers")
@@
-    parser.add_argument("--gen_max_seq_len",
-                        type=int,
+    parser.add_argument("--gen_max_seq_len",
+                        type=_pos_int,
                         default=9216,
                         help="Max sequence length for generation servers")
@@
-    parser.add_argument("--gen_gpu_memory_fraction",
-                        type=float,
+    parser.add_argument("--gen_gpu_memory_fraction",
+                        type=_fraction01,
                         default=0.8,
                         help="GPU memory fraction for generation servers")
@@
-    parser.add_argument("--eplb_num_slots",
-                        type=int,
+    parser.add_argument("--eplb_num_slots",
+                        type=int,
                         default=0,
                         help="Number of slots for eplb")
@@
-    parser.add_argument("--mtp_size",
-                        type=int,
+    parser.add_argument("--mtp_size",
+                        type=int,
                         default=0,
                         help="Number of nextn layers for MTP")
@@
-    parser.add_argument("--cache_transceiver_max_num_tokens",
-                        type=int,
+    parser.add_argument("--cache_transceiver_max_num_tokens",
+                        type=_pos_int,
                         default=8448,
                         help="Max number of tokens for cache transceiver")

135-137: Prefer yaml.safe_dump for emitting plain data.

Safer default and consistent with best practices when not relying on Python object tags.

-        yaml.dump(ctx_config, f, default_flow_style=False, sort_keys=False)
+        yaml.safe_dump(ctx_config, f, default_flow_style=False, sort_keys=False)
@@
-        yaml.dump(gen_config, f, default_flow_style=False, sort_keys=False)
+        yaml.safe_dump(gen_config, f, default_flow_style=False, sort_keys=False)

139-141: Consider returning output paths instead of printing.

Returning (ctx_config_path, gen_config_path) makes this utility easier to test and compose; the CLI can still print them.

I can wire this through and update start_worker.sh accordingly if you’d like.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 81ecc14 and fb674c8.

📒 Files selected for processing (4)

examples/disaggregated/slurm/benchmark/disaggr_torch.slurm (5 hunks)
examples/disaggregated/slurm/benchmark/gen_server_config.py (1 hunks)
examples/disaggregated/slurm/benchmark/gen_worker_config.py (1 hunks)
examples/disaggregated/slurm/benchmark/start_server.sh (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

examples/disaggregated/slurm/benchmark/start_server.sh

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

examples/disaggregated/slurm/benchmark/gen_server_config.py
examples/disaggregated/slurm/benchmark/gen_worker_config.py

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

examples/disaggregated/slurm/benchmark/gen_server_config.py
examples/disaggregated/slurm/benchmark/gen_worker_config.py

🪛 Ruff (0.12.2)

examples/disaggregated/slurm/benchmark/gen_server_config.py

47-47: Line too long (168 > 120)

(E501)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (6)

examples/disaggregated/slurm/benchmark/gen_server_config.py (2)
72-84: Server config structure looks good

The schema is straightforward and maps cleanly to the disaggregated setup (separate CTX/GEN groups with per-host URLs). With the consistency checks above, this should be robust.

72-84: Ensure num_instances always matches the URL lists
Derive the instance counts directly from the discovered hostnames to prevent any mismatch if the CLI‐provided number and the actual host list ever diverge. Please verify that downstream consumers read both num_instances and the urls list consistently.
         'context_servers': {
-            'num_instances': args.num_ctx_servers,
+            'num_instances': len(ctx_hostnames),
             'urls': [f'{host}:{args.worker_port}' for host in ctx_hostnames]
         },
         'generation_servers': {
-            'num_instances': args.num_gen_servers,
+            'num_instances': len(gen_hostnames),
             'urls': [f'{host}:{args.worker_port}' for host in gen_hostnames]
         }
examples/disaggregated/slurm/benchmark/disaggr_torch.slurm (2)

83-86: LGTM: container identity and log layout

The container_name change and the new per-job log directory structure look good and improve discoverability.

221-227: Server launch looks consistent now

workdir is passed (not work_dir), which matches the rest of the script, and the container reuse is coherent. No issues spotted.

examples/disaggregated/slurm/benchmark/gen_worker_config.py (2)

105-107: No action required: “MNNVL” is a valid allreduce_strategy

We’ve confirmed that downstream consumers recognize the "MNNVL" value in multiple places:

In the LLM API arguments literal type: tensorrt_llm/llmapi/llm_args.py includes 'MNNVL' among valid strategies.

In the Python enum: tensorrt_llm/functional.py defines MNNVL = 7.

In the model config mapping: tensorrt_llm/_torch/model_config.py maps the string "MNNVL" to AllReduceStrategy.MNNVL.

In the C++ kernels: cpp/tensorrt_llm/kernels/customAllReduceKernels.h lists MNNVL as an option, and related mnnvl_twoshot_allreduce kernels are registered.

In tests and documentation: multiple unit tests parametrize on AllReduceStrategy.MNNVL and docs under docs/source/advanced/ and release notes list "MNNVL" as a supported strategy.

Since "MNNVL" is spelled correctly and fully supported, no changes are needed here.

108-121: Confirm file-path usage is correct
The validate_load_balancer method in tensorrt_llm/llmapi/llm_args.py (around lines 2243–2252) explicitly checks for a string, treats it as a path to a YAML file, loads it with yaml.safe_load, and constructs a MoeLoadBalancerConfig instance. Downstream, moe_config.load_balancer is passed as that object into the engine config, so writing out a separate YAML and assigning its path is exactly the intended pattern. No changes required.

examples/disaggregated/slurm/benchmark/disaggr_torch.slurm

examples/disaggregated/slurm/benchmark/gen_server_config.py

examples/disaggregated/slurm/benchmark/gen_worker_config.py

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (6)

examples/disaggregated/slurm/benchmark/README.md (6)

14-14: Minor grammar and article usage.

Add an article before “disaggregated server”.

-6.  `start_server.sh`: A shell script responsible for starting disaggregated server using `trtllm-serve` on each allocated machine.
+6.  `start_server.sh`: A shell script responsible for starting the disaggregated server using `trtllm-serve` on each allocated machine.

If you adopt the broader clarification in my other comment for Lines 10–16, this change will be superseded.

73-82: Arg list is clear; consider adding types, defaults, and constraints.

Small doc enhancement: specify accepted values (true|false), integer ranges, defaults, and whether work_dir must be shared across nodes.

Proposed tweaks:

-1.  `worker_type`: Either "CTX" or "GEN" to specify the worker type.
+1.  `worker_type` (string): "CTX" or "GEN".
-2.  `worker_index`: Index of the worker instance.
+2.  `worker_index` (int ≥ 0): Index of the worker instance.
-4.  `worker_port`: Port for the worker to listen on.
+4.  `worker_port` (int, 1–65535): Port for the worker to listen on.
-7.  `enable_pdl`: `true` or `false`.
+7.  `enable_pdl` (bool): `true` or `false`. Default: `false` (if applicable).
-8.  `work_dir`: Work directory for logs and configuration.
+8.  `work_dir` (path): Work directory for logs and configuration (must be accessible to the node running this worker).
-9.  `nsys_on`: Whether to enable nsys profiling.
+9.  `nsys_on` (bool): Whether to enable `nsys` profiling. Default: `false` (if applicable).

Please align these with actual script defaults. If helpful, I can extract them and update the text.

83-92: Server section: add where logs/config are written and any ports/URLs to expect.

A brief note about the server’s log path, the default bind address/port, and the exact config filename will help users automate health checks and log collection.

Possible augmentation:

-This script starts the `trtllm-serve disaggregated` server. It first generates the server configuration using `gen_server_config.py`, then starts the server process.
+This script starts the `trtllm-serve disaggregated` server. It first generates the server configuration using `gen_server_config.py`, then starts the server process.
+The server writes logs under `${work_dir}/server/` and binds to the configured host:port (commonly 0.0.0.0:<port> unless overridden).

If this isn’t accurate, replace with the correct defaults.

96-96: Health-check behavior: mention retry policy/timeouts.

Since this script gates benchmark execution, documenting retry interval and max wait time will prevent confusion on slow startups.

Example addition:

-This script orchestrates the execution of the benchmark client. It waits for the configuration files to be created and for the server's `/health` endpoint to respond, then it runs the benchmark.
+This script orchestrates the benchmark client. It waits for the configuration files and for the server's `/health` endpoint to respond (poll interval: <X>s, timeout: <Y>m), then runs the benchmark.

Please replace / with actual values.

51-58: Improve README with concrete outputs and usage example

The README should clearly document which files are emitted, their names, and how to invoke the script with the correct flag names. Here’s a suggested update:

### `gen_worker_config.py`

-This Python script generates the worker configuration YAML file that configures the `trtllm-serve` workers. It creates separate configurations for context and generation workers with different tensor parallelism, batch sizes, and other parameters.
+This Python script generates worker configuration YAML files for `trtllm-serve` workers, splitting settings for context (CTX) and generation (GEN) phases. When using the eplb load balancer, it also outputs a separate slots configuration.

+**Outputs:**
+- `ctx_config.yaml` — CTX worker configuration (written under `<WORK_DIR>/ctx_config.yaml`)
+- `gen_config.yaml` — GEN worker configuration (written under `<WORK_DIR>/gen_config.yaml`)
+- *(Optional)* `moe_load_balancer.yaml` — load-balancer slots config when `--eplb_num_slots > 0`

 **Usage:**

 The script is called from within `disaggr_torch.slurm`, but can be run directly:

 ```bash
 python examples/disaggregated/slurm/benchmark/gen_worker_config.py \
   --work_dir <WORK_DIR> \
   --ctx_tp_size <CTX_TP> \
   --ctx_batch_size <CTX_BATCH_SIZE> \
   --ctx_max_num_tokens <CTX_MAX_TOKENS> \
   --ctx_max_seq_len <CTX_MAX_SEQ_LEN> \
   [--ctx_free_gpu_memory_fraction <FRACTION>] \
   [--ctx_enable_attention_dp] \
   --gen_tp_size <GEN_TP> \
   --gen_batch_size <GEN_BATCH_SIZE> \
   --gen_max_num_tokens <GEN_MAX_TOKENS> \
   --gen_max_seq_len <GEN_MAX_SEQ_LEN> \
   [--gen_gpu_memory_fraction <FRACTION>] \
   [--gen_enable_attention_dp] \
   [--eplb_num_slots <NUM_SLOTS>] \
   [--mtp_size <MTP_SIZE>] \
   [--cache_transceiver_max_num_tokens <NUM_TOKENS>]

On success, it prints:

ctx_config_file: <WORK_DIR>/ctx_config.yaml gen_config_file: <WORK_DIR>/gen_config.yaml generated successfully


- Correct flag names to use underscores (`--ctx_tp_size`, not `--ctx-tp`)  
- Call out all generated filenames, including `moe_load_balancer.yaml`  
- Include the script’s final print output so users know where to look for the files

---

`59-66`: **Clarify host discovery logic and output filename in `gen_server_config.py` docs**

Please update `examples/disaggregated/slurm/benchmark/README.md` (around lines 59–66) to explicitly describe how host lists are discovered and name the generated config file. For example:

```diff
-### `gen_server_config.py`
-
-This Python script generates the server configuration YAML file that configures the `trtllm-serve` disaggregated server. It reads hostname information from the work directory and creates a configuration that specifies the URLs for context and generation servers.
-
-**Usage:**
-
-The script is called from within `start_server.sh`. It takes arguments for the number of context and generation servers and the work directory.
+### `gen_server_config.py`
+
+This Python script generates the `server_config.yaml` for the `trtllm-serve disaggregated` server.
+It discovers worker hosts by reading all files in the `hostnames` subdirectory of your work directory:
+  - Waits for `HOST_DIR/hostnames` to appear under `--work_dir`  
+  - Expects exactly `<num_ctx_servers> + <num_gen_servers>` files  
+  - Files prefixed `CTX` are treated as context servers; `GEN` as generation servers  
+  - Each file must contain a single hostname (e.g., `ctx-node-01`)  
+
+Once all host files are present, it builds a YAML config with:
+```yaml
+hostname: <current-machine-hostname>
+port: <--server_port>
+backend: pytorch
+context_servers:
+  num_instances: <--num_ctx_servers>
+  urls: ["<CTX1>:<--worker_port>", ...]
+generation_servers:
+  num_instances: <--num_gen_servers>
+  urls: ["<GEN1>:<--worker_port>", ...]
+```
+and writes it to:
+
+```
+<WORK_DIR>/server_config.yaml
+```
+
+**Usage example:**
+```bash
+python examples/disaggregated/slurm/benchmark/gen_server_config.py \
+  --work_dir "${WORK_DIR}" \
+  --num_ctx_servers "${NUM_CTX}" \
+  --num_gen_servers "${NUM_GEN}" \
+  --worker_port 8336 \
+  --server_port 8333
+```
+
+This makes the host‐list requirements and output file explicit and reproducible outside SLURM.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between fb674c8 and 6479716.

📒 Files selected for processing (1)

examples/disaggregated/slurm/benchmark/README.md (3 hunks)

🧰 Additional context used

🪛 LanguageTool

examples/disaggregated/slurm/benchmark/README.md

[grammar] ~14-~14: There might be a mistake here.
Context: ...ver.sh: A shell script responsible for starting disaggregated server using trtllm-serv...

(QB_NEW_EN)

[grammar] ~14-~14: There might be a mistake here.
Context: ...trtllm-serveon each allocated machine. 7. run_benchmark.sh`: A shell script that waits for the serv...

(QB_NEW_EN)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (3)

examples/disaggregated/slurm/benchmark/README.md (3)
7-7: Good high-level framing.

The added overview sentence helps set expectations for the new multi-script orchestration. No issues.

11-13: Nice split between worker and server config generators.

The doc now clearly separates worker and server responsibilities, aligning with recommended usage. This will help users customize CTX/GEN independently.

10-16: Clarify start_server.sh invocation in README

We’ve confirmed that in disaggr_torch.slurm, start_server.sh is launched via
srun … -N 1 -n 1 bash start_server.sh…, i.e. a single server task on one node—not on every allocated machine. Please update the README to reflect this.

• File: examples/disaggregated/slurm/benchmark/README.md
• Lines: 10–16, item 6
-6.  `start_server.sh`: A shell script responsible for starting disaggregated server using `trtllm-serve` on each allocated machine.
+6.  `start_server.sh`: A shell script that starts the disaggregated server using `trtllm-serve` on a designated server node (launched with `srun -N 1 -n 1`), after generating the server configuration.

examples/disaggregated/slurm/benchmark/README.md

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

examples/disaggregated/slurm/benchmark/run_benchmark.sh (1)
62-76: Fix health-check loop: current condition never evaluates HTTP status; may report healthy prematurely

The loop negates curl’s exit code, not the HTTP status. With -w, curl prints status but still exits 0 for 200/404 alike unless --fail is used. Use --fail and test exit status, or capture and compare the code.

Minimal, robust fix:
-timeout=1800
-start_time=$(date +%s)
-while ! curl -s -o /dev/null -w "%{http_code}" http://${hostname}:${port}/health; do
+timeout=1800
+start_time=$(date +%s)
+while ! curl -sfS -o /dev/null "http://${hostname}:${port}/health"; do
     current_time=$(date +%s)
     elapsed=$((current_time - start_time))
     if [ $elapsed -ge $timeout ]; then
         echo "Error: Server is not healthy after ${timeout} seconds"
         exit 1
     fi
     if [ $((elapsed % 30)) -eq 0 ]; then
         echo "Waiting for server to be healthy... (${elapsed}s elapsed)"
     fi
     sleep 10
 done

♻️ Duplicate comments (7)

examples/disaggregated/slurm/benchmark/start_server.sh (1)

1-4: Harden server launcher: add strict mode, usage check, quoting, and verify generated YAML before serving

Same ask as earlier review: fail fast, validate args, and quote expansions. Also ensure the Python step succeeded by checking the YAML exists before starting the server.

Apply:

-#!/bin/bash
-set -u
-set -e
-set -x
+#!/usr/bin/env bash
+set -euo pipefail
+set -x
+IFS=$'\n\t'
 
-num_ctx_servers=$1
-num_gen_servers=$2
-work_dir=$3
-script_dir=$4
+if [ "$#" -lt 4 ]; then
+  echo "Usage: $0 <num_ctx_servers> <num_gen_servers> <work_dir> <script_dir>" >&2
+  exit 2
+fi
+
+num_ctx_servers="$1"
+num_gen_servers="$2"
+work_dir="$3"
+script_dir="$4"
 
-python3 ${script_dir}/gen_server_config.py \
-    --num_ctx_servers ${num_ctx_servers} \
-    --num_gen_servers ${num_gen_servers} \
-    --work_dir ${work_dir}
-echo "server config generated to ${work_dir}/server_config.yaml"
+python3 "${script_dir}/gen_server_config.py" \
+    --num_ctx_servers "${num_ctx_servers}" \
+    --num_gen_servers "${num_gen_servers}" \
+    --work_dir "${work_dir}"
+echo "server config generated to ${work_dir}/server_config.yaml"
+
+if [ ! -f "${work_dir}/server_config.yaml" ]; then
+  echo "Error: ${work_dir}/server_config.yaml not found; aborting." >&2
+  exit 1
+fi
 
-trtllm-serve disaggregated -c ${work_dir}/server_config.yaml -t 7200 -r 7200
+trtllm-serve disaggregated -c "${work_dir}/server_config.yaml" -t 7200 -r 7200

Also applies to: 6-9, 11-17

examples/disaggregated/slurm/benchmark/start_worker.sh (3)

1-5: Fix shebang and enable strict mode (pipefail + safe IFS)

Reliable error propagation with pipelines and robust word-splitting protections.
-#! /bin/bash
-set -u
-set -e
-set -x
+#!/usr/bin/env bash
+set -euo pipefail
+set -x
+IFS=$'\n\t'
31-40: Resolve config_file first and verify it exists

Fail early if the derived YAML is missing.
 if [ "${role}" = "CTX" ]; then
     config_file=${work_dir}/ctx_config.yaml
 elif [ "${role}" = "GEN" ]; then
     config_file=${work_dir}/gen_config.yaml
 else
     echo "Invalid role: ${role}"
     exit 1
 fi
 echo "config_file: ${config_file}"
+if [ ! -f "${config_file}" ]; then
+    echo "Config file not found: ${config_file}" >&2
+    exit 1
+fi
16-19: Critical: echoes unset config_file under set -u; script will abort

Referencing ${config_file} before assignment with set -u causes immediate exit. Remove this echo or move after config resolution.
-unset UCX_TLS
-echo "config_file: ${config_file}, concurrency: ${concurrency}, enable_pdl: ${enable_pdl}, work_dir: ${work_dir}"
-echo "SLURM_PROCID: ${SLURM_PROCID}, hostname: $(hostname), instance_id: ${instance_id}"
+unset UCX_TLS
+echo "SLURM_PROCID: ${SLURM_PROCID}, hostname: $(hostname), instance_id: ${instance_id}, concurrency: ${concurrency}, enable_pdl: ${enable_pdl}, work_dir: ${work_dir}"

examples/disaggregated/slurm/benchmark/disaggr_torch.slurm (3)

119-121: Unify naming with start_worker.sh: prefer nsys_folder variable

start_worker.sh’s 9th arg is nsys_folder; using a different name here invites confusion. Standardize to nsys_folder and pass it through.

-nsys_on=""
-# nsys_on=${full_logdir} # Uncomment this line to enable Nsys profiling
+nsys_folder=""
+# nsys_folder=${full_logdir}/nsys  # Uncomment to enable NSYS; GEN workers will write profiles here
@@
-        bash ${workdir}/start_worker.sh "GEN" ${i} ${model_dir} "8336" ${benchmark_mode} ${concurrency} ${enable_pdl} ${full_logdir} ${nsys_on} \
+        bash ${workdir}/start_worker.sh "GEN" ${i} ${model_dir} "8336" ${benchmark_mode} ${concurrency} ${enable_pdl} ${full_logdir} ${nsys_folder} \
@@
-        bash ${workdir}/start_worker.sh "CTX" ${i} ${model_dir} "8336" ${benchmark_mode} ${concurrency} ${enable_pdl} ${full_logdir} ${nsys_on} \
+        bash ${workdir}/start_worker.sh "CTX" ${i} ${model_dir} "8336" ${benchmark_mode} ${concurrency} ${enable_pdl} ${full_logdir} ${nsys_folder} \

Also applies to: 198-199, 217-218

203-220: Mirror GEN fixes for CTX loop

Drop --segment, be cautious with --ntasks-per-node, and quote nodelist.

-    srun -l -N ${ctx_nodes_num} \
-        --ntasks=${ctx_tp_size} \
-        --ntasks-per-node=${ntasks_per_node} \
-        --segment=${ctx_nodes_num} \
+    srun -l -N "${ctx_nodes_num}" \
+        --ntasks="${ctx_tp_size}" \
         --container-image=${container_image} \
         --container-name=${container_name} \
         --container-mounts=${mounts} \
-        --nodelist=$(IFS=,; echo "${node_list[*]}") \
+        --nodelist="$(IFS=,; echo "${node_list[*]}")" \
         --mpi=pmix \
         bash ${workdir}/start_worker.sh "CTX" ${i} ${model_dir} "8336" ${benchmark_mode} ${concurrency} ${enable_pdl} ${full_logdir} ${nsys_on} \
             &> ${full_logdir}/output_ctx_${i}.log &

169-178: Guard against node under-allocation and bound CTX slice to required size

Ensure your requested GEN+CTX nodes fit the allocation, and restrict ctx_nodes to exactly total_ctx_nodes_num (not the remainder).

 # get the node list for the gen workers
 total_gen_nodes_num=$((gen_nodes_num * num_gen_servers))
 gen_nodes=(${all_nodes[@]:0:${total_gen_nodes_num}})
 echo "gen_nodes: ${gen_nodes[@]}, total_gen_nodes_num: ${total_gen_nodes_num}"
 
 # get the node list for the ctx workers
 total_ctx_nodes_num=$((ctx_nodes_num * num_ctx_servers))
-ctx_nodes=(${all_nodes[@]:${total_gen_nodes_num}:${total_nodes_num}})
+if (( total_gen_nodes_num + total_ctx_nodes_num > total_nodes_num )); then
+    echo "Error: Requested nodes (gen:${total_gen_nodes_num} + ctx:${total_ctx_nodes_num}) exceed allocated nodes (${total_nodes_num})." >&2
+    exit 1
+fi
+ctx_nodes=(${all_nodes[@]:${total_gen_nodes_num}:${total_ctx_nodes_num}})
 echo "ctx_nodes: ${ctx_nodes[@]}, total_ctx_nodes_num: ${total_ctx_nodes_num}"

🧹 Nitpick comments (6)

examples/disaggregated/slurm/benchmark/run_benchmark.sh (3)

80-97: Make log harvesting resilient and shell-safe

Use local vars to avoid global leakage.
Enable nullglob to avoid iterating over literal globs when no files exist.
Quote expansions.

-do_get_logs(){
-    log_path=$1
-    output_folder=$2
+do_get_logs(){
+    local log_path="$1"
+    local output_folder="$2"
+    shopt -s nullglob
@@
-    for gen_file in ${log_path}/output_gen_*.log; do
-        if [ -f "$gen_file" ]; then
-            index=$(basename "$gen_file" | sed 's/output_gen_\(.*\)\.log/\1/')
-            grep -a "'num_ctx_requests': 0, 'num_ctx_tokens': 0" "$gen_file" > "${output_folder}/gen_only_${index}.txt" || true
-        fi
+    for gen_file in "${log_path}"/output_gen_*.log; do
+        if [ -f "$gen_file" ]; then
+            index="$(basename "$gen_file" | sed 's/output_gen_\(.*\)\.log/\1/')"
+            grep -a "'num_ctx_requests': 0, 'num_ctx_tokens': 0" "$gen_file" > "${output_folder}/gen_only_${index}.txt" || true
+        fi
     done
@@
-    for ctx_file in ${log_path}/output_ctx_*.log; do
-        if [ -f "$ctx_file" ]; then
-            index=$(basename "$ctx_file" | sed 's/output_ctx_\(.*\)\.log/\1/')
-            grep -a "'num_generation_tokens': 0" "$ctx_file" > "${output_folder}/ctx_only_${index}.txt" || true
-        fi
+    for ctx_file in "${log_path}"/output_ctx_*.log; do
+        if [ -f "$ctx_file" ]; then
+            index="$(basename "$ctx_file" | sed 's/output_ctx_\(.*\)\.log/\1/')"
+            grep -a "'num_generation_tokens': 0" "$ctx_file" > "${output_folder}/ctx_only_${index}.txt" || true
+        fi
     done
 }
@@
-    do_get_logs ${log_path} ${log_path}/concurrency_${concurrency}
+    do_get_logs "${log_path}" "${log_path}/concurrency_${concurrency}"

Also applies to: 121-123

1-6: Consider pipefail and safer trap for better diagnostics

With pipelines and tee, set -o pipefail ensures failures propagate. The trap is good—retain it.

-set -e
+set -e
+set -o pipefail

55-61: Minor hardening: quote URLs/paths and prefer graceful shutdown before SIGKILL

Quote wget destination and URL.
When shutting down, try TERM first, then KILL after a grace period.

-wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json -O ${shared_gpt_path}
+wget "https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json" -O "${shared_gpt_path}"
@@
-echo "Starting benchmark..."
+echo "Starting benchmark..."
@@
-    mkdir -p ${log_path}/concurrency_${concurrency}
+    mkdir -p "${log_path}/concurrency_${concurrency}"
@@
-echo "Benchmark done, gracefully shutting down server and workers..."
-kill -9 $(ps aux | grep '[s]tart_server.sh' | awk '{print $2}') >/dev/null 2>&1 || true
-kill -9 $(ps aux | grep '[s]tart_worker.sh' | awk '{print $2}') >/dev/null 2>&1 || true
-kill -9 $(ps aux | grep '[t]rtllm-serve' | awk '{print $2}') >/dev/null 2>&1 || true
-sleep 20  # Give processes some time to clean up
+echo "Benchmark done, gracefully shutting down server and workers..."
+pkill -f '[s]tart_server.sh' || true
+pkill -f '[s]tart_worker.sh' || true
+pkill -f '[t]rtllm-serve' || true
+sleep 10
+# Force kill stragglers
+pkill -9 -f '[s]tart_server.sh' || true
+pkill -9 -f '[s]tart_worker.sh' || true
+pkill -9 -f '[t]rtllm-serve' || true
+sleep 5

Also applies to: 100-121, 125-136

examples/disaggregated/slurm/benchmark/start_worker.sh (1)

43-48: Quote hostname redirection and mkdir path

Prevents word splitting and globs.

-if [ "${SLURM_NODEID}" = "0" ]; then
-    mkdir -p ${work_dir}/hostnames/
-    echo $(hostname) > ${work_dir}/hostnames/${role}_${instance_id}.txt
-    echo "hostname saved to ${work_dir}/hostnames/${role}_${instance_id}.txt"
+if [ "${SLURM_NODEID}" = "0" ]; then
+    mkdir -p "${work_dir}/hostnames/"
+    echo "$(hostname)" > "${work_dir}/hostnames/${role}_${instance_id}.txt"
+    echo "hostname saved to ${work_dir}/hostnames/${role}_${instance_id}.txt"
 fi

examples/disaggregated/slurm/benchmark/disaggr_torch.slurm (2)

87-106: Log path label typo: dep vs tp/tep inconsistencies

In full_logdir formatting, one branch uses dep${gen_tp_size}, the other uses tep${gen_tp_size}. Likely a typo; choose a single label (e.g., tp) for consistency and discoverability.

-full_logdir=${logdir}/ctx${num_ctx_servers}_gen${num_gen_servers}_dep${gen_tp_size}_batch${gen_batch_size}_eplb${eplb_num_slots}_mtp${mtp_size}
+full_logdir=${logdir}/ctx${num_ctx_servers}_gen${num_gen_servers}_tp${gen_tp_size}_batch${gen_batch_size}_eplb${eplb_num_slots}_mtp${mtp_size}
@@
-    full_logdir=${logdir}/ctx${num_ctx_servers}_gen${num_gen_servers}_tep${gen_tp_size}_batch${gen_batch_size}_eplb${eplb_num_slots}_mtp${mtp_size}
+    full_logdir=${logdir}/ctx${num_ctx_servers}_gen${num_gen_servers}_tp${gen_tp_size}_batch${gen_batch_size}_eplb${eplb_num_slots}_mtp${mtp_size}

181-184: Quote expansions and paths across srun blocks

General shell safety: quote nodelist construction, directories, and script paths.

-rm -rf ${full_logdir}/hostnames
+rm -rf "${full_logdir}/hostnames"
@@
-    echo "gen node_list index ${i}: ${node_list[@]}"
+    echo "gen node_list index ${i}: ${node_list[@]}"
@@
-        --container-mounts=${mounts} \
+        --container-mounts="${mounts}" \
@@
-    echo "ctx node_list index ${i}: ${node_list[@]}"
+    echo "ctx node_list index ${i}: ${node_list[@]}"
@@
-echo "pid_list: ${pid_list}"
+echo "pid_list: ${pid_list}"
@@
-    --container-image=${container_image} \
-    --container-mounts=${mounts} \
+    --container-image="${container_image}" \
+    --container-mounts="${mounts}" \
@@
-    bash ${workdir}/start_server.sh ${num_ctx_servers} ${num_gen_servers} ${full_logdir} ${workdir} \
+    bash "${workdir}/start_server.sh" "${num_ctx_servers}" "${num_gen_servers}" "${full_logdir}" "${workdir}" \
@@
-    --container-mounts=${mounts} \
+    --container-mounts="${mounts}" \
@@
-    bash ${workdir}/run_benchmark.sh ${isl} ${osl} ${multi_round} ${model_dir} "${concurrency}" ${streaming} ${full_logdir} \
+    bash "${workdir}/run_benchmark.sh" "${isl}" "${osl}" "${multi_round}" "${model_dir}" "${concurrency}" "${streaming}" "${full_logdir}" \
@@
-        kill -9 $(ps aux | grep '[t]rtllm-serve' | awk '{print $2}') >/dev/null 2>&1 || true
+        kill -9 $(ps aux | grep '[t]rtllm-serve' | awk '{print $2}') >/dev/null 2>&1 || true

Also applies to: 189-197, 208-216, 222-231, 236-237, 239-244

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 6479716 and 07b369b.

📒 Files selected for processing (4)

examples/disaggregated/slurm/benchmark/disaggr_torch.slurm (6 hunks)
examples/disaggregated/slurm/benchmark/run_benchmark.sh (4 hunks)
examples/disaggregated/slurm/benchmark/start_server.sh (1 hunks)
examples/disaggregated/slurm/benchmark/start_worker.sh (2 hunks)

🧰 Additional context used

🪛 Shellcheck (0.10.0)

examples/disaggregated/slurm/benchmark/start_worker.sh

[warning] 46-46: Quote this to prevent word splitting.

(SC2046)

[warning] 53-53: Quote this to prevent word splitting.

(SC2046)

[warning] 61-61: Quotes/backslashes will be treated literally. Use an array.

(SC2089)

[warning] 66-66: Quotes/backslashes in this variable will not be respected.

(SC2090)

[warning] 68-68: Quote this to prevent word splitting.

(SC2046)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (1)

examples/disaggregated/slurm/benchmark/run_benchmark.sh (1)
28-48: Quote config_file and anchor YAML key extraction in run_benchmark.sh

The server_config.yaml produced by gen_server_config.py has a flat, top-level hostname: and port: key (followed by nested sections for context_servers and generation_servers that do not define hostname: or port: keys). We can safely extract those two values by:

Quoting ${config_file} everywhere to guard against spaces in $log_path.

Using an anchored awk pattern to match only top-level keys and stop after the first hit.

Apply the following diff to examples/disaggregated/slurm/benchmark/run_benchmark.sh (lines 28–54):
-config_file=${log_path}/server_config.yaml
+config_file="${log_path}/server_config.yaml"
 
 # check if the config file exists every 10 seconds timeout 1800 seconds
 timeout=1800
 start_time=$(date +%s)
-while [ ! -f ${config_file} ]; do
+while [ ! -f "${config_file}" ]; do
     current_time=$(date +%s)
     elapsed=$((current_time - start_time))
@@
-# grep the host and port from the config file
-hostname=$(grep -i "hostname:" ${config_file} | awk '{print $2}')
-port=$(grep -i "port:" ${config_file} | awk '{print $2}')
+# extract only top-level hostname and port, then exit after the first match
+hostname=$(awk '/^[[:space:]]*hostname:[[:space:]]*/ {print $2; exit}' "${config_file}")
+port=$(awk '/^[[:space:]]*port:[[:space:]]*/ {print $2; exit}' "${config_file}")
With this change:

Spaces in $log_path won’t break the script.

Only the first occurrence of each key at the start of a line (allowing leading whitespace) is captured, avoiding any accidental matches elsewhere.

No further changes are needed—anchored awk suffices given the current YAML shape.

examples/disaggregated/slurm/benchmark/disaggr_torch.slurm

examples/disaggregated/slurm/benchmark/start_worker.sh

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (8)

examples/disaggregated/slurm/benchmark/start_worker.sh (4)

1-4: Fix shebang and enable full strict mode (pipefail + safe IFS).

Use a portable shebang and fail fast on pipeline errors; keeps debug tracing. This was raised earlier and still applies.

-#! /bin/bash
-set -u
-set -e
-set -x
+#!/usr/bin/env bash
+set -euo pipefail
+set -x
+IFS=$'\n\t'

6-14: Add argument-count validation with a clear usage message before reading $1..$9.

With set -u, referencing missing positional args will exit with “unbound variable.” Guard early with a concise usage.

+if [ "$#" -lt 9 ]; then
+  echo "Usage: $0 <role:CTX|GEN> <instance_id> <model_path> <port> <benchmark_mode:e2e|gen_only> <concurrency> <enable_pdl:true|false> <work_dir> <nsys_folder|empty>" >&2
+  exit 2
+fi
+
 role=$1
 instance_id=$2
 model_path=$3
 port=$4
 benchmark_mode=$5
 concurrency=$6
 enable_pdl=$7
 work_dir=$8
 nsys_folder=${9:-}

31-40: Validate config_file existence immediately after deriving it.

Failing early prevents a confusing downstream launch error if the YAML wasn’t generated.

 if [ "${role}" = "CTX" ]; then
     config_file=${work_dir}/ctx_config.yaml
 elif [ "${role}" = "GEN" ]; then
     config_file=${work_dir}/gen_config.yaml
 else
     echo "Invalid role: ${role}"
     exit 1
 fi
-echo "config_file: ${config_file}"
+echo "config_file: ${config_file}"
+if [ ! -f "${config_file}" ]; then
+    echo "Config file not found: ${config_file}" >&2
+    exit 1
+fi

51-69: Build NSYS prefix as an array; quote all args; fix SC2089/SC2090/SC2046.

String-building a command breaks quoting; use an array and expand with "${arr[@]}". Also quote host/port/model/config paths. This was flagged previously and remains.

-if [ -z "${nsys_folder:-}" ]; then
-    echo "nsys is not enabled, start normal flow"
-    trtllm-llmapi-launch trtllm-serve ${model_path} --host $(hostname) --port ${port} --extra_llm_api_options ${config_file}
+if [ -z "${nsys_folder:-}" ]; then
+    echo "nsys is not enabled, start normal flow"
+    trtllm-llmapi-launch trtllm-serve "${model_path}" --host "$(hostname)" --port "${port}" --extra_llm_api_options "${config_file}"
 else
-    nsys_prefix=""
-    nsys_file=${nsys_folder}/nsys_worker_proc_${instance_id}_${SLURM_PROCID}
+    nsys_args=()
+    nsys_file="${nsys_folder}/nsys_worker_proc_${instance_id}_${SLURM_PROCID}"
     export TLLM_PROFILE_RECORD_GC=1
     export TLLM_NVTX_DEBUG=1
     if [ "${role}" = "GEN" ]; then
         export TLLM_PROFILE_START_STOP=200-250
-        nsys_prefix="nsys profile -e \"NSYS_MPI_STORE_TEAMS_PER_RANK=1\" -o ${nsys_file} -f true -t cuda,nvtx,python-gil -c cudaProfilerApi --cuda-graph-trace node --capture-range-end=stop --gpu-metrics-devices=none"
-        echo "nsys_prefix: ${nsys_prefix}"
+        nsys_args=(nsys profile -e NSYS_MPI_STORE_TEAMS_PER_RANK=1 -o "${nsys_file}" -f true -t cuda,nvtx,python-gil -c cudaProfilerApi --cuda-graph-trace node --capture-range-end=stop --gpu-metrics-devices=none)
+        echo "nsys_args: ${nsys_args[*]}"
     elif [ "${role}" = "CTX" ]; then
         echo "nsys is not enabled on ctx_gpus"
     fi
-    trtllm-llmapi-launch ${nsys_prefix} \
-        trtllm-serve ${model_path} \
-            --host $(hostname) --port ${port} \
-            --extra_llm_api_options ${config_file}
+    trtllm-llmapi-launch "${nsys_args[@]}" \
+        trtllm-serve "${model_path}" \
+            --host "$(hostname)" --port "${port}" \
+            --extra_llm_api_options "${config_file}"
 fi

examples/disaggregated/slurm/benchmark/disaggr_torch.slurm (4)

119-121: Standardize on nsys_folder to match worker arg name.

Avoids variable-name drift between orchestrator and worker scripts. This was noted previously.

-nsys_on=""
-# nsys_on=${full_logdir} # Uncomment this line to enable Nsys profiling
+nsys_folder=""
+# nsys_folder=${full_logdir}/nsys # Uncomment to enable Nsys profiling; GEN workers will write profiles here

-        bash ${workdir}/start_worker.sh "GEN" ${i} ${model_dir} "8336" ${benchmark_mode} ${concurrency} ${enable_pdl} ${full_logdir} ${nsys_on} \
+        bash ${workdir}/start_worker.sh "GEN" ${i} ${model_dir} "8336" ${benchmark_mode} ${concurrency} ${enable_pdl} ${full_logdir} ${nsys_folder} \

-        bash ${workdir}/start_worker.sh "CTX" ${i} ${model_dir} "8336" ${benchmark_mode} ${concurrency} ${enable_pdl} ${full_logdir} ${nsys_on} \
+        bash ${workdir}/start_worker.sh "CTX" ${i} ${model_dir} "8336" ${benchmark_mode} ${concurrency} ${enable_pdl} ${full_logdir} ${nsys_folder} \

Also applies to: 198-199, 217-218

162-178: Bound ctx_nodes slice and add capacity check against allocated nodes.

Prevents out-of-range slices and misassignment when allocation is insufficient. Previously suggested; still needed.

 all_nodes=($(scontrol show hostname $SLURM_NODELIST | sort))
 total_nodes_num=${#all_nodes[@]}
 echo "all_nodes: ${all_nodes[@]}, total_nodes_num: ${total_nodes_num}"
 
 # get the node list for the gen workers
 total_gen_nodes_num=$((gen_nodes_num * num_gen_servers))
 gen_nodes=(${all_nodes[@]:0:${total_gen_nodes_num}})
 echo "gen_nodes: ${gen_nodes[@]}, total_gen_nodes_num: ${total_gen_nodes_num}"
 
 # get the node list for the ctx workers
 total_ctx_nodes_num=$((ctx_nodes_num * num_ctx_servers))
-ctx_nodes=(${all_nodes[@]:${total_gen_nodes_num}:${total_nodes_num}})
+if (( total_gen_nodes_num + total_ctx_nodes_num > total_nodes_num )); then
+    echo "Error: Requested nodes (gen:${total_gen_nodes_num} + ctx:${total_ctx_nodes_num}) exceed allocated nodes (${total_nodes_num})." >&2
+    exit 1
+fi
+ctx_nodes=(${all_nodes[@]:${total_gen_nodes_num}:${total_ctx_nodes_num}})
 echo "ctx_nodes: ${ctx_nodes[@]}, total_ctx_nodes_num: ${total_ctx_nodes_num}"

189-197: srun: remove invalid --segment and avoid overspecifying tasks-per-node; quote nodelist.

--segment is not an srun option and will fail. Forcing --ntasks-per-node can overconstrain placement. Quote the nodelist so it’s one arg. This mirrors prior feedback.

-    srun -l -N ${gen_nodes_num} \
-        --ntasks=${gen_tp_size} \
-        --ntasks-per-node=${ntasks_per_node} \
-        --segment=${gen_nodes_num} \
+    srun -l -N "${gen_nodes_num}" \
+        --ntasks="${gen_tp_size}" \
         --container-image=${container_image} \
         --container-name=${container_name} \
         --container-mounts=${mounts} \
-        --nodelist=$(IFS=,; echo "${node_list[*]}") \
+        --nodelist="$(IFS=,; echo "${node_list[*]}")" \
         --mpi=pmix \
         bash ${workdir}/start_worker.sh "GEN" ${i} ${model_dir} "8336" ${benchmark_mode} ${concurrency} ${enable_pdl} ${full_logdir} ${nsys_folder} \
             &> ${full_logdir}/output_gen_${i}.log &

Optionally, only set per-node tasks when divisible (place before the loop):

# compute optional arg
if (( gen_tp_size % gen_nodes_num == 0 )); then gen_tasks_per_node=$(( gen_tp_size / gen_nodes_num )); gen_ntasks_per_node_arg=(--ntasks-per-node="${gen_tasks_per_node}"); else gen_ntasks_per_node_arg=(); fi
# ... and add: "${gen_ntasks_per_node_arg[@]}"

208-216: Mirror fix for CTX loop: remove --segment, be cautious with --ntasks-per-node, quote nodelist.

Same runtime failure risk as GEN loop.

-    srun -l -N ${ctx_nodes_num} \
-        --ntasks=${ctx_tp_size} \
-        --ntasks-per-node=${ntasks_per_node} \
-        --segment=${ctx_nodes_num} \
+    srun -l -N "${ctx_nodes_num}" \
+        --ntasks="${ctx_tp_size}" \
         --container-image=${container_image} \
         --container-name=${container_name} \
         --container-mounts=${mounts} \
-        --nodelist=$(IFS=,; echo "${node_list[*]}") \
+        --nodelist="$(IFS=,; echo "${node_list[*]}")" \
         --mpi=pmix \
         bash ${workdir}/start_worker.sh "CTX" ${i} ${model_dir} "8336" ${benchmark_mode} ${concurrency} ${enable_pdl} ${full_logdir} ${nsys_folder} \
             &> ${full_logdir}/output_ctx_${i}.log &

Optional divisible-only arg (before the loop):

if (( ctx_tp_size % ctx_nodes_num == 0 )); then ctx_tasks_per_node=$(( ctx_tp_size / ctx_nodes_num )); ctx_ntasks_per_node_arg=(--ntasks-per-node="${ctx_tasks_per_node}"); else ctx_ntasks_per_node_arg=(); fi
# add: "${ctx_ntasks_per_node_arg[@]}"

🧹 Nitpick comments (5)

examples/disaggregated/slurm/benchmark/start_worker.sh (1)

17-19: Quote all expansions and prefer printf for robust logging/IO.

Addresses Shellcheck SC2046 and avoids surprises if paths or hostnames contain special chars.

-echo "concurrency: ${concurrency}, enable_pdl: ${enable_pdl}, work_dir: ${work_dir}"
-echo "SLURM_PROCID: ${SLURM_PROCID}, hostname: $(hostname), instance_id: ${instance_id}"
+echo "concurrency: ${concurrency}, enable_pdl: ${enable_pdl}, work_dir: ${work_dir}"
+echo "SLURM_PROCID: ${SLURM_PROCID}, hostname: $(hostname), instance_id: ${instance_id}"

-if [ "${SLURM_NODEID}" = "0" ]; then
-    mkdir -p ${work_dir}/hostnames/
-    echo $(hostname) > ${work_dir}/hostnames/${role}_${instance_id}.txt
-    echo "hostname saved to ${work_dir}/hostnames/${role}_${instance_id}.txt"
+if [ "${SLURM_NODEID}" = "0" ]; then
+    mkdir -p "${work_dir}/hostnames/"
+    printf '%s\n' "$(hostname)" > "${work_dir}/hostnames/${role}_${instance_id}.txt"
+    echo "hostname saved to ${work_dir}/hostnames/${role}_${instance_id}.txt"
 fi

Also applies to: 45-48

examples/disaggregated/slurm/benchmark/disaggr_torch.slurm (4)

10-13: Enable full strict mode (pipefail + safe IFS).

Improves failure detection in multi-stage pipelines; keeps xtrace for debugging.

-set -u
-set -e
-set -x
+set -euo pipefail
+set -x
+IFS=$'\n\t'

114-117: Quote trtllm_repo and optionally gate the git call.

Avoids edge cases when trtllm_repo is empty or contains spaces; keeps stderr suppressed as you already do.

-if [ -z "${TRT_LLM_GIT_COMMIT:-}" ]; then
-    export TRT_LLM_GIT_COMMIT=$(git -C ${trtllm_repo} rev-parse --short HEAD 2>/dev/null || echo "unknown")
+if [ -z "${TRT_LLM_GIT_COMMIT:-}" ]; then
+    export TRT_LLM_GIT_COMMIT="$(git -C "${trtllm_repo}" rev-parse --short HEAD 2>/dev/null || echo "unknown")"
     echo "TRT_LLM_GIT_COMMIT: ${TRT_LLM_GIT_COMMIT}"
 fi

136-158: Quote Python CLI args when generating YAML.

Safer if any values contain special chars; consistent with other quoted usages.

-srun -l -N 1 -n 1 \
+srun -l -N 1 -n 1 \
         --container-name=${container_name} \
         --container-mounts=${mounts} \
         --mpi=pmix --overlap \
-        python3 ${workdir}/gen_worker_config.py \
-                --work_dir ${full_logdir} \
-                --ctx_tp_size ${ctx_tp_size} \
-                --ctx_batch_size ${ctx_batch_size} \
-                --ctx_max_num_tokens ${ctx_max_num_tokens} \
-                --ctx_max_seq_len ${ctx_max_seq_len} \
-                --ctx_free_gpu_memory_fraction ${ctx_gpu_frac} \
-                --gen_tp_size ${gen_tp_size} \
-                --gen_batch_size ${gen_batch_size} \
-                --gen_max_num_tokens ${gen_max_num_tokens} \
-                --gen_max_seq_len ${gen_max_seq_len} \
-                --gen_gpu_memory_fraction ${gen_gpu_memory_fraction} \
-                --eplb_num_slots ${eplb_num_slots} \
-                --mtp_size ${mtp_size} \
-                --cache_transceiver_max_num_tokens ${cache_transceiver_max_num_tokens} \
+        python3 "${workdir}/gen_worker_config.py" \
+                --work_dir "${full_logdir}" \
+                --ctx_tp_size "${ctx_tp_size}" \
+                --ctx_batch_size "${ctx_batch_size}" \
+                --ctx_max_num_tokens "${ctx_max_num_tokens}" \
+                --ctx_max_seq_len "${ctx_max_seq_len}" \
+                --ctx_free_gpu_memory_fraction "${ctx_gpu_frac}" \
+                --gen_tp_size "${gen_tp_size}" \
+                --gen_batch_size "${gen_batch_size}" \
+                --gen_max_num_tokens "${gen_max_num_tokens}" \
+                --gen_max_seq_len "${gen_max_seq_len}" \
+                --gen_gpu_memory_fraction "${gen_gpu_memory_fraction}" \
+                --eplb_num_slots "${eplb_num_slots}" \
+                --mtp_size "${mtp_size}" \
+                --cache_transceiver_max_num_tokens "${cache_transceiver_max_num_tokens}" \
                 $(if [ "${ctx_enable_attention_dp}" = "true" ]; then echo "--ctx_enable_attention_dp"; fi) \
                 $(if [ "${gen_enable_attention_dp}" = "true" ]; then echo "--gen_enable_attention_dp"; fi) \
                 2>&1 | tee ${full_logdir}/gen_worker_config.log

232-237: Quote all benchmark script args for consistency.

Minor robustness nit; you already quote concurrency.

-    bash ${workdir}/run_benchmark.sh ${isl} ${osl} ${multi_round} ${model_dir} "${concurrency}" ${streaming} ${full_logdir} \
+    bash "${workdir}/run_benchmark.sh" "${isl}" "${osl}" "${multi_round}" "${model_dir}" "${concurrency}" "${streaming}" "${full_logdir}" \
         &> ${full_logdir}/benchmark.log 2>&1

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 07b369b and 7b9a11c.

📒 Files selected for processing (2)

examples/disaggregated/slurm/benchmark/disaggr_torch.slurm (6 hunks)
examples/disaggregated/slurm/benchmark/start_worker.sh (2 hunks)

🧰 Additional context used

🪛 Shellcheck (0.10.0)

examples/disaggregated/slurm/benchmark/start_worker.sh

[warning] 46-46: Quote this to prevent word splitting.

(SC2046)

[warning] 53-53: Quote this to prevent word splitting.

(SC2046)

[warning] 61-61: Quotes/backslashes will be treated literally. Use an array.

(SC2089)

[warning] 66-66: Quotes/backslashes in this variable will not be respected.

(SC2090)

[warning] 68-68: Quote this to prevent word splitting.

(SC2046)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (12)

examples/disaggregated/slurm/benchmark/gen_server_config.py (3)

1-7: Add NVIDIA 2025 header and switch to structured logging

The file is missing the required NVIDIA copyright header and uses print statements. Prefer the logging module for controllable verbosity and cleaner SLURM logs.

Apply:

+#!/usr/bin/env python3
+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
 import argparse
 import os
 import socket
 import time
 
+import logging
+from typing import List
 import yaml

8-30: Add wait timeout and log-level flags; initialize logging early

Blocking indefinitely is brittle in SLURM runs. Expose a timeout and configurable log level, then initialize logging before any waits.

 if __name__ == "__main__":
     parser = argparse.ArgumentParser()
     parser.add_argument("--num_ctx_servers",
                         type=int,
                         required=True,
                         help="Number of context servers")
     parser.add_argument("--num_gen_servers",
                         type=int,
                         required=True,
                         help="Number of generation servers")
     parser.add_argument("--work_dir",
                         type=str,
                         default="logs",
                         help="Work directory")
     parser.add_argument("--worker_port",
                         type=int,
                         default=8336,
                         help="Worker port")
     parser.add_argument("--server_port",
                         type=int,
                         default=8333,
                         help="Server port")
+    parser.add_argument("--wait_timeout_sec",
+                        type=int,
+                        default=600,
+                        help="Max seconds to wait for hostname files before failing")
+    parser.add_argument("--log_level",
+                        type=str,
+                        default="INFO",
+                        choices=["DEBUG", "INFO", "WARNING", "ERROR", "CRITICAL"],
+                        help="Logging level")
     args = parser.parse_args()
+
+    logging.basicConfig(
+        level=getattr(logging, args.log_level.upper(), logging.INFO),
+        format="%(asctime)s %(levelname)s %(message)s",
+    )
+    logger = logging.getLogger(__name__)

36-49: Avoid infinite wait; check CTX/GEN counts separately; dedup and validate; fix E501 by logging

The current polling can block forever and relies on a combined count. Filter CTX/GEN files, enforce a deadline, deduplicate hostnames, and replace long prints with concise logs.

-    #check all of the hostnames in the hostnames folder exists, if not, sleep 10 seconds and check again
-    hostnames_folder = os.path.join(args.work_dir, "hostnames")
-    while not os.path.exists(hostnames_folder):
-        time.sleep(10)
-        print(f"Waiting for hostnames folder {hostnames_folder} to be found")
-    hostnames = os.listdir(hostnames_folder)
-    # check length of hostnames is equal to num_ctx_servers + num_gen_servers, if not, sleep 10 seconds and check again
-    while len(hostnames) != args.num_ctx_servers + args.num_gen_servers:
-        time.sleep(10)
-        hostnames = os.listdir(hostnames_folder)
-        print(
-            f"Waiting for hostnames to be found in {hostnames_folder}, current length: {len(hostnames)}, expected length: {args.num_ctx_servers + args.num_gen_servers}"
-        )
-    print(f"All hostnames found in {hostnames_folder}")
+    # Wait for hostnames/ to appear and contain the expected CTX/GEN files
+    hostnames_folder = os.path.join(args.work_dir, "hostnames")
+    deadline = time.monotonic() + args.wait_timeout_sec
+    while True:
+        if os.path.exists(hostnames_folder):
+            all_files = os.listdir(hostnames_folder)
+            ctx_files = sorted(f for f in all_files if f.startswith("CTX"))
+            gen_files = sorted(f for f in all_files if f.startswith("GEN"))
+            if (len(ctx_files) == args.num_ctx_servers and
+                    len(gen_files) == args.num_gen_servers):
+                break
+            logger.info("Waiting for hostnames in %s: CTX=%d/%d, GEN=%d/%d",
+                        hostnames_folder, len(ctx_files), args.num_ctx_servers,
+                        len(gen_files), args.num_gen_servers)
+        else:
+            logger.info("Waiting for hostnames folder %s to be created ...",
+                        hostnames_folder)
+        if time.monotonic() >= deadline:
+            raise TimeoutError(
+                f"Timed out after {args.wait_timeout_sec}s waiting for "
+                f"hostname files in {hostnames_folder}"
+            )
+        time.sleep(10)
+    logger.info("All required hostname files found in %s", hostnames_folder)
 
-    # get the ctx and gen hostnames from the hostnames file
-    ctx_hostnames = []
-    gen_hostnames = []
-    for hostname_file in hostnames:
-        hostname_file_path = os.path.join(hostnames_folder, hostname_file)
-        with open(hostname_file_path, 'r') as f:
-            actual_hostname = f.read().strip()
-            print(f"Hostname: {actual_hostname} in {hostname_file}")
-
-        if hostname_file.startswith("CTX"):
-            ctx_hostnames.append(actual_hostname)
-        elif hostname_file.startswith("GEN"):
-            gen_hostnames.append(actual_hostname)
+    # Read actual hostnames from CTX/GEN files
+    ctx_hostnames: List[str] = []
+    for hostname_file in ctx_files:
+        hostname_file_path = os.path.join(hostnames_folder, hostname_file)
+        with open(hostname_file_path, "r", encoding="utf-8") as f:
+            actual_hostname = f.read().strip()
+        logger.debug("CTX host: %s (%s)", actual_hostname, hostname_file)
+        ctx_hostnames.append(actual_hostname)
+
+    gen_hostnames: List[str] = []
+    for hostname_file in gen_files:
+        hostname_file_path = os.path.join(hostnames_folder, hostname_file)
+        with open(hostname_file_path, "r", encoding="utf-8") as f:
+            actual_hostname = f.read().strip()
+        logger.debug("GEN host: %s (%s)", actual_hostname, hostname_file)
+        gen_hostnames.append(actual_hostname)
+
+    # Deduplicate while preserving order
+    ctx_hostnames = list(dict.fromkeys(ctx_hostnames))
+    gen_hostnames = list(dict.fromkeys(gen_hostnames))
+
+    if (len(ctx_hostnames) != args.num_ctx_servers or
+            len(gen_hostnames) != args.num_gen_servers):
+        raise RuntimeError(
+            "Mismatch between expected and discovered hosts: "
+            f"expected CTX={args.num_ctx_servers}, GEN={args.num_gen_servers}; "
+            f"found CTX={len(ctx_hostnames)}, GEN={len(gen_hostnames)}"
+        )
 
-    print(f"ctx_hostnames: {ctx_hostnames}")
-    print(f"gen_hostnames: {gen_hostnames}")
+    logger.info("ctx_hostnames: %s", ctx_hostnames)
+    logger.info("gen_hostnames: %s", gen_hostnames)

Also applies to: 51-66

examples/disaggregated/slurm/benchmark/gen_worker_config.py (4)

1-4: Add NVIDIA 2025 header

Mandatory per repo guidelines.

+#!/usr/bin/env python3
+# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+
 import argparse
 import os

21-22: Align default cache_transceiver_max_num_tokens with CLI (8448)

Signature default 4608 disagrees with the CLI default 8448; this can surprise direct callers.

-                    cache_transceiver_max_num_tokens: int = 4608) -> None:
+                    cache_transceiver_max_num_tokens: int = 8448) -> None:

23-45: Replace stale/inaccurate docstring with a correct Google-style docstring

The current docstring lists non-existent args and omits real ones.

-    """
-    Generate configuration YAML file for disaggregated inference.
-
-    Args:
-        config_path: Path to save the config file
-        model_path: Path to the model
-        num_ctx_servers: Number of context servers
-        ctx_tp_size: Tensor parallel size for context servers
-        ctx_batch_size: Batch size for context servers
-        ctx_max_num_tokens: Max number of tokens for context servers
-        ctx_max_seq_len: Max sequence length for context servers
-        ctx_free_gpu_memory_fraction: Free GPU memory fraction for context servers
-        ctx_enable_attention_dp: Enable attention DP for context servers
-        num_gen_servers: Number of generation servers
-        gen_tp_size: Tensor parallel size for generation servers
-        gen_batch_size: Batch size for generation servers
-        gen_max_num_tokens: Max number of tokens for generation servers
-        gen_enable_attention_dp: Enable attention DP for generation servers
-        gen_gpu_memory_fraction: GPU memory fraction for generation servers
-        eplb_num_slots: Number of slots for eplb
-        worker_start_port: Start port for workers
-        server_port: Server port
-    """
+    """
+    Generate per-role YAML configs for disaggregated workers.
+
+    Writes two files into work_dir:
+      - ctx_config.yaml: context workers
+      - gen_config.yaml: generation workers
+
+    Args:
+        work_dir: Directory to write the YAML files.
+        ctx_tp_size: Tensor-parallel size for context workers.
+        ctx_batch_size: Max batch size for context workers.
+        ctx_max_num_tokens: Max tokens per batch for context workers.
+        ctx_max_seq_len: Max sequence length for context workers.
+        ctx_free_gpu_memory_fraction: Fraction of GPU memory kept free on ctx.
+        ctx_enable_attention_dp: Enable attention data parallel on ctx.
+        gen_tp_size: Tensor-parallel size for generation workers.
+        gen_batch_size: Max batch size for generation workers.
+        gen_max_num_tokens: Max tokens per batch for generation workers.
+        gen_max_seq_len: Max sequence length for generation workers.
+        gen_enable_attention_dp: Enable attention data parallel on gen.
+        gen_gpu_memory_fraction: Fraction of GPU memory kept free on gen.
+        eplb_num_slots: MoE expert-parallel load balancer slots (0 disables).
+        mtp_size: If > 0, enable MTP speculative decoding with this many layers.
+        cache_transceiver_max_num_tokens: Max tokens buffered by transceiver.
+
+    Returns:
+        None
+    """

132-137: Create work_dir before writing; avoid shadowing function name

Writing YAMLs fails if work_dir is missing. Also, local variable gen_config_file shadows the function name; rename for clarity.

-    ctx_config_file = os.path.join(work_dir, "ctx_config.yaml")
-    gen_config_file = os.path.join(work_dir, "gen_config.yaml")
-    with open(ctx_config_file, "w") as f:
+    os.makedirs(work_dir, exist_ok=True)
+    ctx_config_file = os.path.join(work_dir, "ctx_config.yaml")
+    gen_config_path = os.path.join(work_dir, "gen_config.yaml")
+    with open(ctx_config_file, "w", encoding="utf-8") as f:
         yaml.dump(ctx_config, f, default_flow_style=False, sort_keys=False)
-    with open(gen_config_file, "w") as f:
+    with open(gen_config_path, "w", encoding="utf-8") as f:
         yaml.dump(gen_config, f, default_flow_style=False, sort_keys=False)
 
-    print(
-        f"ctx_config_file: {ctx_config_file} gen_config_file: {gen_config_file} generated successfully"
-    )
+    print(f"ctx_config_file: {ctx_config_file} "
+          f"gen_config_file: {gen_config_path} generated successfully")

Also applies to: 139-141

examples/disaggregated/slurm/benchmark/README.md (1)

114-117: Startup order: start the server before workers (or document worker retry)

Current steps launch workers before the server, but workers do not implement retry/wait for server readiness. Safer to start the server first.

-5.  `disaggr_torch.slurm` runs `gen_worker_config.py` to create worker configuration files.
-6.  `disaggr_torch.slurm` uses `srun` to launch `start_worker.sh` on all nodes, starting the MPI workers for both context and generation phases.
-7.  `disaggr_torch.slurm` starts the main `trtllm-serve` process using `start_server.sh`, which generates the server configuration using `gen_server_config.py`.
+5.  `disaggr_torch.slurm` runs `gen_worker_config.py` to create worker configuration files.
+6.  `disaggr_torch.slurm` starts the main `trtllm-serve` process using `start_server.sh`, which generates the server configuration via `gen_server_config.py`.
+7.  `disaggr_torch.slurm` uses `srun` to launch `start_worker.sh` on all nodes. Workers will wait/retry until the server is reachable. (If no retry is implemented, keep this step after server startup.)

examples/disaggregated/slurm/benchmark/start_worker.sh (4)

1-5: Fix shebang and enable strict mode with pipefail + safe IFS

Improve portability and early failure behavior.

-#! /bin/bash
-set -u
-set -e
-set -x
+#!/usr/bin/env bash
+set -euo pipefail
+set -x
+IFS=$'\n\t'

6-15: Validate argument count and provide usage

Prevents undefined behavior when fewer than 9 args are supplied.

 role=$1
 instance_id=$2
 model_path=$3
 port=$4
 benchmark_mode=$5
 concurrency=$6
 enable_pdl=$7
 work_dir=$8
 nsys_folder=${9:-}
 
+if [ "$#" -lt 9 ]; then
+  echo "Usage: $0 <role:CTX|GEN> <instance_id> <model_path> <port> <benchmark_mode:e2e|gen_only> <concurrency> <enable_pdl:true|false> <work_dir> <nsys_folder|empty>" >&2
+  exit 2
+fi

31-40: Check config existence after selection

Fail fast with a clear message if the resolved YAML is missing.

 if [ "${role}" = "CTX" ]; then
     config_file=${work_dir}/ctx_config.yaml
 elif [ "${role}" = "GEN" ]; then
     config_file=${work_dir}/gen_config.yaml
 else
     echo "Invalid role: ${role}"
     exit 1
 fi
-echo "config_file: ${config_file}"
+echo "config_file: ${config_file}"
+if [ ! -f "${config_file}" ]; then
+    echo "Config file not found: ${config_file}" >&2
+    exit 1
+fi

51-69: Use arrays for NSYS prefix; quote all args; fix SC2089/SC2090/SC2046

Compose NSYS arguments as an array and quote all parameter expansions.

-if [ -z "${nsys_folder:-}" ]; then
-    echo "nsys is not enabled, start normal flow"
-    trtllm-llmapi-launch trtllm-serve ${model_path} --host $(hostname) --port ${port} --extra_llm_api_options ${config_file}
+if [ -z "${nsys_folder:-}" ]; then
+    echo "nsys is not enabled, start normal flow"
+    trtllm-llmapi-launch trtllm-serve "${model_path}" \
+        --host "$(hostname)" --port "${port}" \
+        --extra_llm_api_options "${config_file}"
 else
-    nsys_prefix=""
-    nsys_file=${nsys_folder}/nsys_worker_proc_${instance_id}_${SLURM_PROCID}
+    nsys_args=()
+    nsys_file="${nsys_folder}/nsys_worker_proc_${instance_id}_${SLURM_PROCID}"
     export TLLM_PROFILE_RECORD_GC=1
     export TLLM_NVTX_DEBUG=1
     if [ "${role}" = "GEN" ]; then
         export TLLM_PROFILE_START_STOP=200-250
-        nsys_prefix="nsys profile -e \"NSYS_MPI_STORE_TEAMS_PER_RANK=1\" -o ${nsys_file} -f true -t cuda,nvtx,python-gil -c cudaProfilerApi --cuda-graph-trace node --capture-range-end=stop --gpu-metrics-devices=none"
-        echo "nsys_prefix: ${nsys_prefix}"
+        nsys_args=(nsys profile -e NSYS_MPI_STORE_TEAMS_PER_RANK=1 -o "${nsys_file}" -f true -t cuda,nvtx,python-gil -c cudaProfilerApi --cuda-graph-trace node --capture-range-end=stop --gpu-metrics-devices=none)
+        echo "nsys_args: ${nsys_args[*]}"
     elif [ "${role}" = "CTX" ]; then
         echo "nsys is not enabled on ctx_gpus"
     fi
-    trtllm-llmapi-launch ${nsys_prefix} \
-        trtllm-serve ${model_path} \
-            --host $(hostname) --port ${port} \
-            --extra_llm_api_options ${config_file}
+    trtllm-llmapi-launch "${nsys_args[@]}" \
+        trtllm-serve "${model_path}" \
+            --host "$(hostname)" --port "${port}" \
+            --extra_llm_api_options "${config_file}"
 fi

🧹 Nitpick comments (7)

examples/disaggregated/slurm/benchmark/gen_server_config.py (3)

68-75: Prefer logging for local hostname; minor nit

Use logging instead of print for consistency.

-    # get current hostname from env
-    hostname = socket.gethostname()
-    print(f"Current hostname: {hostname}")
+    # get current hostname
+    hostname = socket.gethostname()
+    logger.info("Current hostname: %s", hostname)

86-90: Stabilize YAML output and final message

Explicit YAML options yield deterministic output; switch final print to logging.

-    with open(os.path.join(args.work_dir, "server_config.yaml"), "w") as f:
-        yaml.dump(server_config, f)
-    print(
-        f"Server config file {os.path.join(args.work_dir, 'server_config.yaml')} generated"
-    )
+    out_path = os.path.join(args.work_dir, "server_config.yaml")
+    with open(out_path, "w", encoding="utf-8") as f:
+        yaml.dump(server_config, f, default_flow_style=False, sort_keys=False)
+    logger.info("Server config generated: %s", out_path)

22-29: Validate port ranges (0 < port < 65536)

Defensive check to prevent misconfigurations that would cause bind/connect failures.

     parser.add_argument("--worker_port",
                         type=int,
                         default=8336,
                         help="Worker port")
     parser.add_argument("--server_port",
                         type=int,
                         default=8333,
                         help="Server port")
     args = parser.parse_args()
+
+    for p_name in ("worker_port", "server_port"):
+        p_val = getattr(args, p_name)
+        if not (0 < p_val < 65536):
+            raise ValueError(f"{p_name} must be in (0, 65535), got {p_val}")

examples/disaggregated/slurm/benchmark/gen_worker_config.py (2)

67-69: Ensure CUDA graph batch sizes are unique and sorted

Avoid duplicates and keep the list monotonically increasing.

-    gen_cuda_graph_batch_sizes = [
-        1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 768, 1024, 2048, gen_batch_size
-    ]
+    gen_cuda_graph_batch_sizes = sorted({
+        1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 768, 1024, 2048, gen_batch_size
+    })

1-6: Optional: switch to logging and expose --log_level

For parity with gen_server_config.py and easier debugging in SLURM.

 import argparse
 import os
 
+import logging
 import yaml
 
@@
 if __name__ == "__main__":
     parser = argparse.ArgumentParser()
 
+    parser.add_argument("--log_level",
+                        type=str,
+                        default="INFO",
+                        choices=["DEBUG", "INFO", "WARNING", "ERROR", "CRITICAL"],
+                        help="Logging level")
@@
     args = parser.parse_args()
 
+    logging.basicConfig(
+        level=getattr(logging, args.log_level.upper(), logging.INFO),
+        format="%(asctime)s %(levelname)s %(message)s",
+    )
+    logger = logging.getLogger(__name__)
+
     gen_config_file(args.work_dir, args.ctx_tp_size, args.ctx_batch_size,
                     args.ctx_max_num_tokens, args.ctx_max_seq_len,
                     args.ctx_free_gpu_memory_fraction,
                     args.ctx_enable_attention_dp, args.gen_tp_size,
                     args.gen_batch_size, args.gen_max_num_tokens,
                     args.gen_max_seq_len, args.gen_enable_attention_dp,
                     args.gen_gpu_memory_fraction, args.eplb_num_slots,
                     args.mtp_size, args.cache_transceiver_max_num_tokens)

Also applies to: 145-151, 212-221

examples/disaggregated/slurm/benchmark/README.md (1)

14-14: Grammar nit: add article

“starting disaggregated server” → “starting the disaggregated server”.

-6.  `start_server.sh`: A shell script responsible for starting disaggregated server using `trtllm-serve` on each allocated machine.
+6.  `start_server.sh`: A shell script responsible for starting the disaggregated server using `trtllm-serve` on each allocated machine.

examples/disaggregated/slurm/benchmark/start_worker.sh (1)

16-19: Quote variable expansions in echoes and commands

Minor, but prevents word splitting if values contain spaces.
-echo "concurrency: ${concurrency}, enable_pdl: ${enable_pdl}, work_dir: ${work_dir}"
-echo "SLURM_PROCID: ${SLURM_PROCID}, hostname: $(hostname), instance_id: ${instance_id}"
+echo "concurrency: ${concurrency}, enable_pdl: ${enable_pdl}, work_dir: ${work_dir}"
+echo "SLURM_PROCID: ${SLURM_PROCID}, hostname: $(hostname), instance_id: ${instance_id}"
(Note: keep as-is if your style excludes quoting inside echo for simple status lines.)

Also applies to: 53-53, 68-68

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 7b9a11c and 85f2bb8.

📒 Files selected for processing (8)

examples/disaggregated/slurm/benchmark/README.md (3 hunks)
examples/disaggregated/slurm/benchmark/disaggr_torch.slurm (6 hunks)
examples/disaggregated/slurm/benchmark/gen_server_config.py (1 hunks)
examples/disaggregated/slurm/benchmark/gen_worker_config.py (1 hunks)
examples/disaggregated/slurm/benchmark/gen_yaml.py (0 hunks)
examples/disaggregated/slurm/benchmark/run_benchmark.sh (4 hunks)
examples/disaggregated/slurm/benchmark/start_server.sh (1 hunks)
examples/disaggregated/slurm/benchmark/start_worker.sh (2 hunks)

💤 Files with no reviewable changes (1)

examples/disaggregated/slurm/benchmark/gen_yaml.py

🚧 Files skipped from review as they are similar to previous changes (3)

examples/disaggregated/slurm/benchmark/start_server.sh
examples/disaggregated/slurm/benchmark/disaggr_torch.slurm
examples/disaggregated/slurm/benchmark/run_benchmark.sh

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

examples/disaggregated/slurm/benchmark/gen_worker_config.py
examples/disaggregated/slurm/benchmark/gen_server_config.py

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

examples/disaggregated/slurm/benchmark/gen_worker_config.py
examples/disaggregated/slurm/benchmark/gen_server_config.py

🧠 Learnings (1)

📚 Learning: 2025-08-12T10:28:57.320Z

Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-12T10:28:57.320Z
Learning: Applies to **/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py} : Prepend NVIDIA copyright header (current year) to all source files

Applied to files:

examples/disaggregated/slurm/benchmark/gen_worker_config.py

🪛 LanguageTool

examples/disaggregated/slurm/benchmark/README.md

[grammar] ~14-~14: There might be a mistake here.
Context: ...ver.sh: A shell script responsible for starting disaggregated server using trtllm-serv...

(QB_NEW_EN)

[grammar] ~14-~14: There might be a mistake here.
Context: ...trtllm-serveon each allocated machine. 7. run_benchmark.sh`: A shell script that waits for the serv...

(QB_NEW_EN)

🪛 Shellcheck (0.10.0)

examples/disaggregated/slurm/benchmark/start_worker.sh

[warning] 46-46: Quote this to prevent word splitting.

(SC2046)

[warning] 53-53: Quote this to prevent word splitting.

(SC2046)

[warning] 61-61: Quotes/backslashes will be treated literally. Use an array.

(SC2089)

[warning] 66-66: Quotes/backslashes in this variable will not be respected.

(SC2090)

[warning] 68-68: Quote this to prevent word splitting.

(SC2046)

🪛 Ruff (0.12.2)

examples/disaggregated/slurm/benchmark/gen_server_config.py

47-47: Line too long (168 > 120)

(E501)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

examples/disaggregated/slurm/benchmark/start_worker.sh

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (4)

examples/disaggregated/slurm/benchmark/disaggr_torch.slurm (4)

119-121: Use a consistent “nsys_folder” arg name and semantics across script and workers

Earlier reviews noted workers expect a folder path parameter (nsys_folder). This script still defines/passes nsys_on. Standardize to nsys_folder and pass it through.

Apply:

-nsys_on=""
-# nsys_on=${full_logdir} # Uncomment this line to enable Nsys profiling
+nsys_folder=""
+# nsys_folder=${full_logdir}/nsys # Uncomment to enable Nsys profiling; workers will write profiles here
@@
-        bash ${workdir}/start_worker.sh "GEN" ${i} ${model_dir} "8336" ${benchmark_mode} ${concurrency} ${enable_pdl} ${full_logdir} ${nsys_on} \
+        bash ${workdir}/start_worker.sh "GEN" ${i} ${model_dir} "8336" ${benchmark_mode} ${concurrency} ${enable_pdl} ${full_logdir} ${nsys_folder} \
@@
-        bash ${workdir}/start_worker.sh "CTX" ${i} ${model_dir} "8336" ${benchmark_mode} ${concurrency} ${enable_pdl} ${full_logdir} ${nsys_on} \
+        bash ${workdir}/start_worker.sh "CTX" ${i} ${model_dir} "8336" ${benchmark_mode} ${concurrency} ${enable_pdl} ${full_logdir} ${nsys_folder} \

Verification script to confirm start_worker.sh’s expected last-arg name:

#!/bin/bash
# Inspect worker launcher args to confirm the profiling arg name/semantics
fd -a start_worker.sh
rg -nP 'add_argument|^\s*#.*nsys|nsys_folder|nsys_on' $(fd -a start_worker.sh)

Also applies to: 196-197, 214-215

165-178: Bound ctx_nodes slice and add capacity guard

Current code assigns ctx_nodes to the remainder of all_nodes, not the exact required count, and doesn’t fail fast if requested nodes exceed allocation.

Apply:

 all_nodes=($(scontrol show hostname $SLURM_NODELIST | sort))
 total_nodes_num=${#all_nodes[@]}
 echo "all_nodes: ${all_nodes[@]}, total_nodes_num: ${total_nodes_num}"
 
 # get the node list for the gen workers
 total_gen_nodes_num=$((gen_nodes_num * num_gen_servers))
 gen_nodes=(${all_nodes[@]:0:${total_gen_nodes_num}})
 echo "gen_nodes: ${gen_nodes[@]}, total_gen_nodes_num: ${total_gen_nodes_num}"
 
 # get the node list for the ctx workers
 total_ctx_nodes_num=$((ctx_nodes_num * num_ctx_servers))
-ctx_nodes=(${all_nodes[@]:${total_gen_nodes_num}:${total_nodes_num}})
+if (( total_gen_nodes_num + total_ctx_nodes_num > total_nodes_num )); then
+    echo "Error: Requested nodes (gen:${total_gen_nodes_num} + ctx:${total_ctx_nodes_num}) exceed allocated nodes (${total_nodes_num})." >&2
+    exit 1
+fi
+ctx_nodes=(${all_nodes[@]:${total_gen_nodes_num}:${total_ctx_nodes_num}})
 echo "ctx_nodes: ${ctx_nodes[@]}, total_ctx_nodes_num: ${total_ctx_nodes_num}"

187-197: Remove invalid srun --segment and avoid overspecifying --ntasks-per-node; quote nodelist

--segment is not a valid Slurm option; this will fail immediately. Also, forcing --ntasks-per-node can overconstrain placement when tasks aren’t divisible. Since you already pass an explicit nodelist and total --ntasks, omit it (or compute it only when divisible).

Apply:

     srun -l -N ${gen_nodes_num} \
         --ntasks=${gen_tp_size} \
-        --ntasks-per-node=${ntasks_per_node} \
-        --segment=${gen_nodes_num} \
         --container-image=${container_image} \
         --container-name=${container_name} \
         --container-mounts=${mounts} \
-        --nodelist=$(IFS=,; echo "${node_list[*]}") \
+        --nodelist="$(IFS=,; echo "${node_list[*]}")" \
         --mpi=pmix \
         bash ${workdir}/start_worker.sh "GEN" ${i} ${model_dir} "8336" ${benchmark_mode} ${concurrency} ${enable_pdl} ${full_logdir} ${nsys_on} \
         &> ${full_logdir}/output_gen_${i}.log &

If you must constrain per-node tasks, compute it only when divisible:

# before the loop
if (( gen_tp_size % gen_nodes_num == 0 )); then gen_tasks_per_node=$(( gen_tp_size / gen_nodes_num )); gen_ntpn=(--ntasks-per-node="${gen_tasks_per_node}"); else gen_ntpn=(); fi
# and add: "${gen_ntpn[@]}" after --ntasks

205-215: Mirror the GEN fixes: drop --segment, be cautious with --ntasks-per-node; quote nodelist

Same as above for CTX: remove invalid --segment, don’t overspecify --ntasks-per-node unless divisible, and quote the nodelist.

Apply:

     srun -l -N ${ctx_nodes_num} \
         --ntasks=${ctx_tp_size} \
-        --ntasks-per-node=${ntasks_per_node} \
-        --segment=${ctx_nodes_num} \
         --container-image=${container_image} \
         --container-name=${container_name} \
         --container-mounts=${mounts} \
-        --nodelist=$(IFS=,; echo "${node_list[*]}") \
+        --nodelist="$(IFS=,; echo "${node_list[*]}")" \
         --mpi=pmix \
         bash ${workdir}/start_worker.sh "CTX" ${i} ${model_dir} "8336" ${benchmark_mode} ${concurrency} ${enable_pdl} ${full_logdir} ${nsys_on} \
             &> ${full_logdir}/output_ctx_${i}.log &

🧹 Nitpick comments (3)

examples/disaggregated/slurm/benchmark/disaggr_torch.slurm (3)
49-51: Derive tasks-per-node robustly from Slurm env (TASKS_PER_NODE may be the only one set)

SLURM_NTASKS_PER_NODE is not always exported; clusters commonly set SLURM_TASKS_PER_NODE (e.g., “4(x2)” or “2,2”). Parse it when available; fall back sanely.

Apply:
-# Get GPUs per node dynamically from SLURM
-ntasks_per_node=${SLURM_NTASKS_PER_NODE:-4}  # Default to 4 for GB200
+# Derive tasks per node from Slurm (prefer SLURM_TASKS_PER_NODE), else default.
+if [[ -n "${SLURM_TASKS_PER_NODE:-}" ]]; then
+  # Example formats: "4(x2)", "2,2", "1(x3),2" -> pick the first number
+  ntasks_per_node="$(sed -E 's/^([0-9]+).*/\1/' <<< "${SLURM_TASKS_PER_NODE}")"
+elif [[ -n "${SLURM_NTASKS_PER_NODE:-}" ]]; then
+  ntasks_per_node="${SLURM_NTASKS_PER_NODE}"
+else
+  ntasks_per_node=4  # Default for GB200
+fi
10-13: Ensure cleanup always runs: use a trap instead of a final scancel

With set -e, earlier failures skip the final scancel. Add a trap so the job gets cancelled on any error or signal; you can then drop the explicit scancel at the end.

Apply:
 set -u
 set -e
+set -o pipefail
 set -x
+cleanup() { scancel "${SLURM_JOB_ID}" 2>/dev/null || true; }
+trap cleanup EXIT INT TERM
@@
-scancel ${SLURM_JOB_ID}
+# cleanup handled by trap
Also applies to: 233-233

94-96: Unused variables (ctx_gpus/gen_gpus)

ctx_gpus and gen_gpus are computed but never used. Drop them or print for observability.

Apply:
-ctx_gpus=$((num_ctx_servers * ctx_tp_size))
-gen_gpus=$((num_gen_servers * gen_tp_size))
+# ctx_gpus=$((num_ctx_servers * ctx_tp_size))
+# gen_gpus=$((num_gen_servers * gen_tp_size))

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 85f2bb8 and c4b2da2.

📒 Files selected for processing (1)

examples/disaggregated/slurm/benchmark/disaggr_torch.slurm (6 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (1)

examples/disaggregated/slurm/benchmark/disaggr_torch.slurm (1)

146-153: No action needed: Flags correctly match parser definitions

The Slurm script is using the exact flag names defined in gen_worker_config.py:

--ctx_free_gpu_memory_fraction corresponds to parser.add_argument("--ctx_free_gpu_memory_fraction", …) (line 167)

--gen_gpu_memory_fraction corresponds to parser.add_argument("--gen_gpu_memory_fraction", …) (line 195)

There is no mismatch between the script and the generator’s expected arguments.

Likely an incorrect or invalid review comment.

examples/disaggregated/slurm/benchmark/disaggr_torch.slurm

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

examples/disaggregated/slurm/benchmark/disaggr_torch.slurm (1)
129-134: Ensure pipefail is enabled to surface errors in piped commands

The check PIPEFAIL=off shows that pipefail is not active in the shell running this SLURM script, so any failures in the srun … | tee pipeline will be masked. We must explicitly enable it at the top of the script.

• File: examples/disaggregated/slurm/benchmark/disaggr_torch.slurm
• Add before any piped commands (e.g., before line 1):
- #!/bin/bash
+ #!/bin/bash
+ set -o pipefail
This change ensures that if pip install -e . (or any other stage) fails, the pipeline aborts with a non-zero exit code and CI will catch the error.
Applies likewise to the block on lines 136–159.

♻️ Duplicate comments (3)

examples/disaggregated/slurm/benchmark/disaggr_torch.slurm (3)
194-205: Mirror the GEN fix for CTX loop: don’t force --ntasks-per-node unless safe

Same rationale as the GEN loop: either omit or set conditionally when evenly divisible.
+ctx_ntasks_per_node_arg=()
+if (( ctx_tp_size % ctx_nodes_num == 0 )); then
+  ctx_ntasks_per_node_arg=(--ntasks-per-node=$(( ctx_tp_size / ctx_nodes_num )))
+fi
 srun -l -N ${ctx_nodes_num} \
-    --ntasks=${ctx_tp_size} \
-    --ntasks-per-node=${ntasks_per_node} \
+    --ntasks=${ctx_tp_size} "${ctx_ntasks_per_node_arg[@]}" \
     --container-image=${container_image} \
10-13: Enable pipefail so failures in piped commands (e.g., srun | tee) fail the script

Later you pipe to tee; without pipefail, upstream failures are ignored and the script continues, producing misleading “success.” Add pipefail here.
 set -u
 set -e
+set -o pipefail
 set -x
165-168: Slice the ctx_nodes to the exact requested count and add a capacity check

Right now ctx_nodes uses the remainder of all_nodes, which can overshoot the requested count and mask allocation mistakes. Add a guard and bound the slice.
 all_nodes=($(scontrol show hostname $SLURM_NODELIST | sort))
 total_nodes_num=${#all_nodes[@]}
 echo "all_nodes: ${all_nodes[@]}, total_nodes_num: ${total_nodes_num}"
 
 # get the node list for the gen workers
 total_gen_nodes_num=$((gen_nodes_num * num_gen_servers))
 gen_nodes=(${all_nodes[@]:0:${total_gen_nodes_num}})
 echo "gen_nodes: ${gen_nodes[@]}, total_gen_nodes_num: ${total_gen_nodes_num}"
 
 # get the node list for the ctx workers
 total_ctx_nodes_num=$((ctx_nodes_num * num_ctx_servers))
-ctx_nodes=(${all_nodes[@]:${total_gen_nodes_num}:${total_nodes_num}})
+if (( total_gen_nodes_num + total_ctx_nodes_num > total_nodes_num )); then
+    echo "Error: Requested nodes exceed allocation: gen=${total_gen_nodes_num} ctx=${total_ctx_nodes_num} total=${total_nodes_num}" >&2
+    exit 1
+fi
+ctx_nodes=(${all_nodes[@]:${total_gen_nodes_num}:${total_ctx_nodes_num}})
 echo "ctx_nodes: ${ctx_nodes[@]}, total_ctx_nodes_num: ${total_ctx_nodes_num}"
Also applies to: 169-173, 174-178

🧹 Nitpick comments (6)

examples/disaggregated/slurm/benchmark/disaggr_torch.slurm (6)
49-51: Prefer SLURM_GPUS_ON_NODE as fallback when deriving ntasks_per_node

On GPU jobs, GPUS_ON_NODE is often the better signal than NTASKS_PER_NODE. Keep your current default but prefer GPUs when available.
-ntasks_per_node=${SLURM_NTASKS_PER_NODE:-4}  # Default to 4 for GB200
+ntasks_per_node="${SLURM_NTASKS_PER_NODE:-${SLURM_GPUS_ON_NODE:-4}}"  # Prefer GPUs-per-node when present; default 4 for GB200
162-164: Derivation of per-group node counts looks right; consider guarding for zero or negative values

If tp_size is 0 (misconfiguration), these could compute to 0 and later srun -N 0 will fail obscurely. A simple validation improves UX.
 ctx_nodes_num=$(((ctx_tp_size + ntasks_per_node - 1) / ntasks_per_node))
 gen_nodes_num=$(((gen_tp_size + ntasks_per_node - 1) / ntasks_per_node))
+if (( ctx_nodes_num <= 0 || gen_nodes_num <= 0 )); then
+    echo "Error: computed nodes per group (ctx=${ctx_nodes_num}, gen=${gen_nodes_num}) must be > 0. Check *_tp_size and ntasks_per_node." >&2
+    exit 1
+fi
179-180: Nit: leftover cleanup path is now unused elsewhere

If hostnames artifacts are no longer produced, consider removing this cleanup to avoid confusion.
-rm -rf ${full_logdir}/hostnames
+# Legacy cleanup (hostnames) removed; no longer produced
181-193: Avoid overspecifying --ntasks-per-node; let Slurm place tasks unless evenly divisible

You already compute -N and total --ntasks. For non-divisible cases, forcing --ntasks-per-node can overconstrain and fail placement. Either drop it or add it conditionally when divisible.

Option A (simpler): drop it
     srun -l -N ${gen_nodes_num} \
         --ntasks=${gen_tp_size} \
-        --ntasks-per-node=${ntasks_per_node} \
         --container-image=${container_image} \
Option B (conditional):
+gen_ntasks_per_node_arg=()
+if (( gen_tp_size % gen_nodes_num == 0 )); then
+  gen_ntasks_per_node_arg=(--ntasks-per-node=$(( gen_tp_size / gen_nodes_num )))
+fi
 srun -l -N ${gen_nodes_num} \
-    --ntasks=${gen_tp_size} \
-    --ntasks-per-node=${ntasks_per_node} \
+    --ntasks=${gen_tp_size} "${gen_ntasks_per_node_arg[@]}" \
     --container-image=${container_image} \
222-222: Be cautious with scancel inside the job; ensure all logs are flushed

Calling scancel on the current job is fine here but can truncate trailing logs. Consider a short wait or ensuring backgrounded sruns have terminated before scancel.
- scancel ${SLURM_JOB_ID}
+# Give backgrounded workers a moment to flush logs before cancellation (optional)
+sleep 2
+scancel ${SLURM_JOB_ID}
119-121: Standardize Nsys profiling directory usage

Change the profiling flag to a directory path (nsys_folder) and create it when set. This aligns with the 9th positional argument in start_worker.sh (verified: $9 → nsys_folder), and clarifies intent versus a boolean flag.

Files/locations to update:

examples/disaggregated/slurm/benchmark/disaggr_torch.slurm, lines 119–121

Same file, around the two start_worker.sh invocations at lines 181–193 and 194–205

Suggested diff:
-nsys_on=""
-# nsys_on=${full_logdir} # Uncomment this line to enable Nsys profiling
+nsys_folder=""
+# nsys_folder="${full_logdir}/nsys"  # Uncomment to enable Nsys profiling and store traces under this directory
+if [ -n "${nsys_folder}" ]; then
+    mkdir -p "${nsys_folder}"
+fi
@@ (around start_worker.sh invocations)
-        bash ${workdir}/start_worker.sh "GEN" ${i} ${model_dir} "8336" ${benchmark_mode} ${concurrency} ${enable_pdl} ${full_logdir} ${nsys_on} \
-        bash ${workdir}/start_worker.sh "CTX" ${i} ${model_dir} "8336" ${benchmark_mode} ${concurrency} ${enable_pdl} ${full_logdir} ${nsys_on} \
+        bash ${workdir}/start_worker.sh "GEN" ${i} ${model_dir} "8336" ${benchmark_mode} ${concurrency} ${enable_pdl} ${full_logdir} ${nsys_folder} \
+        bash ${workdir}/start_worker.sh "CTX" ${i} ${model_dir} "8336" ${benchmark_mode} ${concurrency} ${enable_pdl} ${full_logdir} ${nsys_folder} \

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between c4b2da2 and 19e8c55.

📒 Files selected for processing (1)

examples/disaggregated/slurm/benchmark/disaggr_torch.slurm (6 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (4)

examples/disaggregated/slurm/benchmark/disaggr_torch.slurm (4)

87-88: LGTM: container_name/logdir reorg improves clarity and log scoping

Consistent container identity and per-job log roots are helpful for debugging. No action needed.

114-117: LGTM: safe export of TRT_LLM_GIT_COMMIT with fallback

The short hash export with graceful fallback to “unknown” is robust. No action needed.

209-213: LGTM: server startup separated and parameterized via start_server.sh

Clear separation and reuse. No action needed.

217-220: LGTM: benchmark step parameterization and logging target are clear

Straightforward pass-through to run_benchmark.sh with logs centralized. No action needed.

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (6)

examples/disaggregated/slurm/benchmark/disaggr_torch.slurm (6)

10-12: Add pipefail and make tracing optional to avoid masking failures and noisy logs

Without pipefail, failures in pipelines (e.g., srun|tee, python|tee) won’t stop the script.
Consider gating set -x behind an env flag to keep logs readable.

Apply:

 set -u
 set -e
-set -x
+set -o pipefail
+# Enable xtrace only when TRACE=1
+[ "${TRACE:-0}" = "1" ] && set -x

119-121: Unify Nsys variable naming with start_worker.sh/start_server.sh

This script passes ${nsys_on} to workers, but previous iterations and other scripts often use nsys_folder. Mismatched names will silently disable profiling.

Confirm what start_worker.sh expects (nsys_on vs nsys_folder).
Standardize on one name and create the folder when enabled.

Suggested:

-nsys_on=""
-# nsys_on=${full_logdir} # Uncomment this line to enable Nsys profiling
+nsys_folder=""
+# nsys_folder=${full_logdir}/nsys  # Uncomment to enable Nsys profiling
+# [ -n "${nsys_folder}" ] && mkdir -p "${nsys_folder}"

Then update both worker launches to pass ${nsys_folder}.

To verify across the repo:

#!/bin/bash
# Verify argument naming in worker/server scripts
rg -nP 'nsys_(on|folder)' examples/disaggregated/slurm/benchmark/start_worker.sh
rg -nP 'nsys_(on|folder)' examples/disaggregated/slurm/benchmark/start_server.sh || true

136-159: YAML generation is piped to tee: without pipefail, failures won’t abort

Since this pipeline uses | tee, add set -o pipefail (see earlier comment) so gen failures stop the job.

Also, double-check the option names align with gen_worker_config.py:

You pass --ctx_free_gpu_memory_fraction but --gen_gpu_memory_fraction; verify the intended naming asymmetry.

To confirm option names:

#!/bin/bash
rg -nP 'argparse|add_argument\(' examples/disaggregated/slurm/benchmark/gen_worker_config.py -n -C2

169-178: Add capacity guard and slice ctx_nodes to the exact required count

Right now ctx_nodes uses the remainder of all nodes instead of the exact number needed, and there’s no check that requested nodes fit the allocation.

Apply:

 all_nodes=($(scontrol show hostname $SLURM_NODELIST | sort))
 total_nodes_num=${#all_nodes[@]}
 echo "all_nodes: ${all_nodes[@]}, total_nodes_num: ${total_nodes_num}"
 
 # get the node list for the gen workers
 total_gen_nodes_num=$((gen_nodes_num * num_gen_servers))
 gen_nodes=(${all_nodes[@]:0:${total_gen_nodes_num}})
 echo "gen_nodes: ${gen_nodes[@]}, total_gen_nodes_num: ${total_gen_nodes_num}"
 
 # get the node list for the ctx workers
 total_ctx_nodes_num=$((ctx_nodes_num * num_ctx_servers))
-ctx_nodes=(${all_nodes[@]:${total_gen_nodes_num}:${total_nodes_num}})
+if (( total_gen_nodes_num + total_ctx_nodes_num > total_nodes_num )); then
+    echo "Error: Requested nodes (gen:${total_gen_nodes_num} + ctx:${total_ctx_nodes_num}) exceed allocated nodes (${total_nodes_num})." >&2
+    exit 1
+fi
+ctx_nodes=(${all_nodes[@]:${total_gen_nodes_num}:${total_ctx_nodes_num}})
 echo "ctx_nodes: ${ctx_nodes[@]}, total_ctx_nodes_num: ${total_ctx_nodes_num}"

181-192: Critical: GEN worker srun lacks per-group nodelist and overspecifies --ntasks-per-node

You compute gen_nodes but don’t assign per-group slices to each srun, so simultaneous sruns can contend for the same nodes.
Passing --ntasks-per-node=${ntasks_per_node} unconditionally can make --ntasks inconsistent (e.g., gen_tp_size=6, nodes=2 → requires 12 tasks). This will fail or overconstrain placement.

Apply:

-# start the gen workers
-for i in $(seq 0 $((num_gen_servers - 1))); do
-    srun -l -N ${gen_nodes_num} \
-        --ntasks=${gen_tp_size} \
-        --ntasks-per-node=${ntasks_per_node} \
-        --container-image=${container_image} \
-        --container-name=${container_name} \
-        --container-mounts=${mounts} \
-        --mpi=pmix \
-        bash ${workdir}/start_worker.sh "GEN" ${i} ${model_dir} "8336" ${benchmark_mode} ${concurrency} ${enable_pdl} ${full_logdir} ${nsys_on} \
-        &> ${full_logdir}/output_gen_${i}.log &
-done
+# start the gen workers
+for i in $(seq 0 $((num_gen_servers - 1))); do
+    start=$(( i * gen_nodes_num ))
+    node_list=("${gen_nodes[@]:${start}:${gen_nodes_num}}")
+    # Only constrain tasks-per-node when evenly divisible
+    if (( gen_tp_size % gen_nodes_num == 0 )); then
+        gen_tasks_per_node=$(( gen_tp_size / gen_nodes_num ))
+        gen_ntasks_per_node_arg=(--ntasks-per-node="${gen_tasks_per_node}")
+    else
+        gen_ntasks_per_node_arg=()
+    fi
+    srun -l -N "${gen_nodes_num}" \
+        --ntasks="${gen_tp_size}" \
+        "${gen_ntasks_per_node_arg[@]}" \
+        --container-image="${container_image}" \
+        --container-name="${container_name}_gen_${i}" \
+        --container-mounts="${mounts}" \
+        --nodelist="$(IFS=,; echo "${node_list[*]}")" \
+        --mpi=pmix \
+        bash "${workdir}/start_worker.sh" "GEN" "${i}" "${model_dir}" "8336" "${benchmark_mode}" "${concurrency}" "${enable_pdl}" "${full_logdir}" "${nsys_on}" \
+        &> "${full_logdir}/output_gen_${i}.log" &
+done

195-205: Critical: CTX worker srun has the same issues as GEN (missing nodelist, overconstrained tasks-per-node)

Mirror the GEN fix for CTX.

Apply:

-# start the ctx workers
-for i in $(seq 0 $((num_ctx_servers - 1))); do
-    srun -l -N ${ctx_nodes_num} \
-        --ntasks=${ctx_tp_size} \
-        --ntasks-per-node=${ntasks_per_node} \
-        --container-image=${container_image} \
-        --container-name=${container_name} \
-        --container-mounts=${mounts} \
-        --mpi=pmix \
-        bash ${workdir}/start_worker.sh "CTX" ${i} ${model_dir} "8336" ${benchmark_mode} ${concurrency} ${enable_pdl} ${full_logdir} ${nsys_on} \
-            &> ${full_logdir}/output_ctx_${i}.log &
-done
+# start the ctx workers
+for i in $(seq 0 $((num_ctx_servers - 1))); do
+    start=$(( i * ctx_nodes_num ))
+    node_list=("${ctx_nodes[@]:${start}:${ctx_nodes_num}}")
+    if (( ctx_tp_size % ctx_nodes_num == 0 )); then
+        ctx_tasks_per_node=$(( ctx_tp_size / ctx_nodes_num ))
+        ctx_ntasks_per_node_arg=(--ntasks-per-node="${ctx_tasks_per_node}")
+    else
+        ctx_ntasks_per_node_arg=()
+    fi
+    srun -l -N "${ctx_nodes_num}" \
+        --ntasks="${ctx_tp_size}" \
+        "${ctx_ntasks_per_node_arg[@]}" \
+        --container-image="${container_image}" \
+        --container-name="${container_name}_ctx_${i}" \
+        --container-mounts="${mounts}" \
+        --nodelist="$(IFS=,; echo "${node_list[*]}")" \
+        --mpi=pmix \
+        bash "${workdir}/start_worker.sh" "CTX" "${i}" "${model_dir}" "8336" "${benchmark_mode}" "${concurrency}" "${enable_pdl}" "${full_logdir}" "${nsys_on}" \
+        &> "${full_logdir}/output_ctx_${i}.log" &
+done

🧹 Nitpick comments (4)

examples/disaggregated/slurm/benchmark/disaggr_torch.slurm (4)
49-50: Deriving per-node capacity from SLURM: verify it matches the intended scheduling unit

You later treat ntasks_per_node as the capacity to spread TP across nodes. Ensure this equals your “tasks per node” for each worker type (often GPUs per node). If the cluster exposes GPUs via SLURM_GPUS_ON_NODE, prefer that or fall back to SLURM_NTASKS_PER_NODE.

Option:
-ntasks_per_node=${SLURM_NTASKS_PER_NODE:-4}  # Default to 4 for GB200
+# Prefer GPUs per node if exported, else tasks-per-node, else 4
+ntasks_per_node=${SLURM_GPUS_ON_NODE:-${SLURM_NTASKS_PER_NODE:-4}}
If GEN/CTX use different per-node capacities, you may need separate variables.

162-168: Node count derivation: ensure the divisor matches per-node capacity and handles edge-cases

ctx_nodes_num/gen_nodes_num are computed via ceil-divide by ntasks_per_node. This is fine if ntasks_per_node is the true per-node capacity for each worker group; otherwise placement will be wrong.

If capacities differ per group, derive separately (e.g., ctx_tasks_per_node, gen_tasks_per_node). Also guard zero/negative inputs if user passes 0 sizes.

216-220: Nit: pass container-image for benchmark srun for consistency

All other sruns specify --container-image; this one relies on the named container only. Add it for consistency across environments.
 srun -l --container-name=${container_name} \
-    --container-mounts=${mounts} \
+    --container-image=${container_image} \
+    --container-mounts=${mounts} \
     --mpi=pmix --overlap -N 1 -n 1 \
222-222: Prefer trap-based cleanup to ensure workers/servers are torn down on early failures

If any stage fails (with set -e), the final scancel won’t run. A trap guarantees cleanup.
+# Ensure cleanup even on early exit
+trap 'scancel "${SLURM_JOB_ID}"' EXIT
-
-scancel ${SLURM_JOB_ID}

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 19e8c55 and 24453fb.

📒 Files selected for processing (1)

examples/disaggregated/slurm/benchmark/disaggr_torch.slurm (6 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (4)

examples/disaggregated/slurm/benchmark/disaggr_torch.slurm (4)

47-48: LGTM: optional install path is clean

Defaulting trtllm_repo to empty string and guarding the install step with -n prevents accidental installs. Good.

87-88: LGTM: normalized container name and structured log directories

Using a stable container name and job-scoped log roots keeps artifacts organized.

114-118: LGTM: commit id discovery is resilient

The fallback to “unknown” avoids set -e exits when trtllm_repo is unset. Good.

207-214: LGTM: server launch looks consistent

Server start uses a single task on one node and passes ${workdir} correctly.

Signed-off-by: Kaiyu Xie <[email protected]> Update Signed-off-by: Kaiyu Xie <[email protected]> Update Signed-off-by: Kaiyu Xie <[email protected]> Fix Signed-off-by: Kaiyu Xie <[email protected]> Update Signed-off-by: Kaiyu Xie <[email protected]> Update Signed-off-by: Kaiyu Xie <[email protected]> Update documents Signed-off-by: Kaiyu Xie <[email protected]> Better error handling Signed-off-by: Kaiyu Xie <[email protected]> Fix Signed-off-by: Kaiyu Xie <[email protected]> Fix Signed-off-by: Kaiyu Xie <[email protected]> Fix Signed-off-by: Kaiyu Xie <[email protected]> Update Signed-off-by: Kaiyu Xie <[email protected]> Update Signed-off-by: Kaiyu Xie <[email protected]> Minor fix Signed-off-by: Kaiyu Xie <[email protected]>

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

examples/disaggregated/slurm/benchmark/run_benchmark.sh (1)

62-76: Health check ignores HTTP status; check for 200

Current loop exits on any successful TCP connection, even 5xx. Ensure HTTP 200.

-# check server is health by curl every 10 seconds timeout 1800 seconds
+# check server health: expect HTTP 200 within timeout
 timeout=1800
 start_time=$(date +%s)
-while ! curl -s -o /dev/null -w "%{http_code}" http://${hostname}:${port}/health; do
+while true; do
+    status=$(curl -s -o /dev/null -w "%{http_code}" "http://${hostname}:${port}/health" || echo "000")
+    if [ "${status}" = "200" ]; then
+        break
+    fi
     current_time=$(date +%s)
     elapsed=$((current_time - start_time))
     if [ $elapsed -ge $timeout ]; then
-        echo "Error: Server is not healthy after ${timeout} seconds"
+        echo "Error: Server is not healthy after ${timeout} seconds (last http_code=${status})"
         exit 1
     fi
     if [ $((elapsed % 30)) -eq 0 ]; then
-        echo "Waiting for server to be healthy... (${elapsed}s elapsed)"
+        echo "Waiting for server to be healthy... (${elapsed}s elapsed, http_code=${status})"
     fi
     sleep 10
 done

♻️ Duplicate comments (17)

examples/disaggregated/slurm/benchmark/gen_server_config.py (4)

1-7: Add NVIDIA 2025 header and use logging instead of print

Required by repo guidelines; also switch to logging for controllable verbosity.

+#!/usr/bin/env python3
+# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
 import argparse
 import os
 import socket
 import time
+import logging
 
 import yaml

8-30: Introduce timeout and log-level flags; initialize logging early

Prevents indefinite hangs and allows verbosity control.

 if __name__ == "__main__":
     parser = argparse.ArgumentParser()
     parser.add_argument("--num_ctx_servers",
                         type=int,
                         required=True,
                         help="Number of context servers")
     parser.add_argument("--num_gen_servers",
                         type=int,
                         required=True,
                         help="Number of generation servers")
     parser.add_argument("--work_dir",
                         type=str,
                         default="logs",
                         help="Work directory")
     parser.add_argument("--worker_port",
                         type=int,
                         default=8336,
                         help="Worker port")
     parser.add_argument("--server_port",
                         type=int,
                         default=8333,
                         help="Server port")
+    parser.add_argument("--wait_timeout_sec",
+                        type=int,
+                        default=600,
+                        help="Max seconds to wait for hostname files before failing")
+    parser.add_argument("--log_level",
+                        type=str,
+                        default="INFO",
+                        choices=["DEBUG", "INFO", "WARNING", "ERROR", "CRITICAL"],
+                        help="Logging level")
     args = parser.parse_args()
+
+    logging.basicConfig(
+        level=getattr(logging, args.log_level.upper(), logging.INFO),
+        format="%(asctime)s %(levelname)s %(message)s",
+    )
+    logger = logging.getLogger(__name__)

36-49: Avoid infinite wait; verify CTX/GEN counts separately; fix long line (Ruff E501)

Use a deadline and separate CTX/GEN file checks; replace prints with logging.

-    #check all of the hostnames in the hostnames folder exists, if not, sleep 10 seconds and check again
-    hostnames_folder = os.path.join(args.work_dir, "hostnames")
-    while not os.path.exists(hostnames_folder):
-        time.sleep(10)
-        print(f"Waiting for hostnames folder {hostnames_folder} to be found")
-    hostnames = os.listdir(hostnames_folder)
-    # check length of hostnames is equal to num_ctx_servers + num_gen_servers, if not, sleep 10 seconds and check again
-    while len(hostnames) != args.num_ctx_servers + args.num_gen_servers:
-        time.sleep(10)
-        hostnames = os.listdir(hostnames_folder)
-        print(
-            f"Waiting for hostnames to be found in {hostnames_folder}, current length: {len(hostnames)}, expected length: {args.num_ctx_servers + args.num_gen_servers}"
-        )
-    print(f"All hostnames found in {hostnames_folder}")
+    hostnames_folder = os.path.join(args.work_dir, "hostnames")
+    deadline = time.monotonic() + args.wait_timeout_sec
+    while True:
+        if os.path.exists(hostnames_folder):
+            all_files = os.listdir(hostnames_folder)
+            ctx_files = sorted(f for f in all_files if f.startswith("CTX"))
+            gen_files = sorted(f for f in all_files if f.startswith("GEN"))
+            if (len(ctx_files) == args.num_ctx_servers and
+                    len(gen_files) == args.num_gen_servers):
+                break
+            logger.info("Waiting for hostnames in %s: CTX=%d/%d, GEN=%d/%d",
+                        hostnames_folder, len(ctx_files), args.num_ctx_servers,
+                        len(gen_files), args.num_gen_servers)
+        else:
+            logger.info("Waiting for hostnames folder %s to be created ...",
+                        hostnames_folder)
+        if time.monotonic() >= deadline:
+            raise TimeoutError(
+                f"Timed out after {args.wait_timeout_sec}s waiting for "
+                f"hostname files in {hostnames_folder}"
+            )
+        time.sleep(10)
+    logger.info("All required hostname files found in %s", hostnames_folder)

51-64: Read CTX/GEN files deterministically; dedupe; use explicit encoding

Avoids stray files and duplicate entries.

-    # get the ctx and gen hostnames from the hostnames file
-    ctx_hostnames = []
-    gen_hostnames = []
-    for hostname_file in hostnames:
-        hostname_file_path = os.path.join(hostnames_folder, hostname_file)
-        with open(hostname_file_path, 'r') as f:
-            actual_hostname = f.read().strip()
-            print(f"Hostname: {actual_hostname} in {hostname_file}")
-
-        if hostname_file.startswith("CTX"):
-            ctx_hostnames.append(actual_hostname)
-        elif hostname_file.startswith("GEN"):
-            gen_hostnames.append(actual_hostname)
+    # Read hostnames
+    ctx_hostnames = []
+    for hostname_file in ctx_files:
+        with open(os.path.join(hostnames_folder, hostname_file), "r", encoding="utf-8") as f:
+            ctx_hostnames.append(f.read().strip())
+    gen_hostnames = []
+    for hostname_file in gen_files:
+        with open(os.path.join(hostnames_folder, hostname_file), "r", encoding="utf-8") as f:
+            gen_hostnames.append(f.read().strip())
+    # Deduplicate while preserving order
+    ctx_hostnames = list(dict.fromkeys(ctx_hostnames))
+    gen_hostnames = list(dict.fromkeys(gen_hostnames))

examples/disaggregated/slurm/benchmark/gen_worker_config.py (4)

1-4: Add NVIDIA 2025 header (and optional shebang)

Required header is missing.

+#!/usr/bin/env python3
+# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
 import argparse
 import os

24-24: Align default cache_transceiver_max_num_tokens with CLI (8448)

Prevents surprising behavior for library callers.

-                    cache_transceiver_max_num_tokens: int = 4608) -> None:
+                    cache_transceiver_max_num_tokens: int = 8448) -> None:

25-49: Docstring out of sync with signature; replace with accurate Google-style docstring

Current docstring lists wrong/missing params.

-    """
-    Generate configuration YAML file for disaggregated inference.
-
-    Args:
-        config_path: Path to save the config file
-        model_path: Path to the model
-        num_ctx_servers: Number of context servers
-        ctx_tp_size: Tensor parallel size for context servers
-        ctx_pp_size: Pipeline parallel size for context servers
-        ctx_batch_size: Batch size for context servers
-        ctx_max_num_tokens: Max number of tokens for context servers
-        ctx_max_seq_len: Max sequence length for context servers
-        ctx_free_gpu_memory_fraction: Free GPU memory fraction for context servers
-        ctx_enable_attention_dp: Enable attention DP for context servers
-        num_gen_servers: Number of generation servers
-        gen_tp_size: Tensor parallel size for generation servers
-        gen_pp_size: Pipeline parallel size for generation servers
-        gen_batch_size: Batch size for generation servers
-        gen_max_num_tokens: Max number of tokens for generation servers
-        gen_enable_attention_dp: Enable attention DP for generation servers
-        gen_gpu_memory_fraction: GPU memory fraction for generation servers
-        eplb_num_slots: Number of slots for eplb
-        worker_start_port: Start port for workers
-        server_port: Server port
-    """
+    """
+    Generate per-role YAML configs for disaggregated inference workers.
+
+    Writes two files into work_dir:
+      - ctx_config.yaml (context servers)
+      - gen_config.yaml (generation servers)
+
+    Args:
+        work_dir: Directory for output YAML files.
+        ctx_tp_size: TP size for context servers.
+        ctx_pp_size: PP size for context servers.
+        ctx_batch_size: Max batch size for context servers.
+        ctx_max_num_tokens: Max tokens per batch for context servers.
+        ctx_max_seq_len: Max sequence length for context servers.
+        ctx_free_gpu_memory_fraction: Fraction of GPU memory to keep free (CTX).
+        ctx_enable_attention_dp: Enable attention DP for CTX.
+        gen_tp_size: TP size for generation servers.
+        gen_pp_size: PP size for generation servers.
+        gen_batch_size: Max batch size for generation servers.
+        gen_max_num_tokens: Max tokens per batch for generation servers.
+        gen_max_seq_len: Max sequence length for generation servers.
+        gen_enable_attention_dp: Enable attention DP for GEN.
+        gen_gpu_memory_fraction: Fraction of GPU memory to keep free (GEN).
+        eplb_num_slots: Expert-parallel load balancer slots (0 disables).
+        mtp_size: If >0, enable MTP speculative decoding with this many layers.
+        cache_transceiver_max_num_tokens: Max tokens buffered by transceiver.
+    """

136-137: Ensure work_dir exists before writing files

Prevents FileNotFoundError.

-    ctx_config_file = os.path.join(work_dir, "ctx_config.yaml")
-    gen_config_file = os.path.join(work_dir, "gen_config.yaml")
+    os.makedirs(work_dir, exist_ok=True)
+    ctx_config_path = os.path.join(work_dir, "ctx_config.yaml")
+    gen_config_path = os.path.join(work_dir, "gen_config.yaml")

examples/disaggregated/slurm/benchmark/start_worker.sh (6)

6-14: Validate arguments early with a usage message

Avoids undefined behavior when too few args are passed.

+if [ "$#" -lt 9 ]; then
+  echo "Usage: $0 <role:CTX|GEN> <instance_id> <model_path> <port> <benchmark_mode:e2e|gen_only> <concurrency> <enable_pdl:true|false> <work_dir> <nsys_folder|empty>" >&2
+  exit 2
+fi
 role=$1
 instance_id=$2
 model_path=$3

1-5: Fix shebang and enable strict mode (+pipefail, safe IFS)

Improves robustness and portability.

-#! /bin/bash
-set -u
-set -e
-set -x
+#!/usr/bin/env bash
+set -euo pipefail
+set -x
+IFS=$'\n\t'

44-48: Guard SLURM_NODEID and quote expansions

Prevents unbound var errors and word splitting.

-if [ "${SLURM_NODEID}" = "0" ]; then
-    mkdir -p ${work_dir}/hostnames/
-    echo $(hostname) > ${work_dir}/hostnames/${role}_${instance_id}.txt
-    echo "hostname saved to ${work_dir}/hostnames/${role}_${instance_id}.txt"
+if [ "${SLURM_NODEID:-}" = "0" ]; then
+    mkdir -p "${work_dir}/hostnames/"
+    echo "$(hostname)" > "${work_dir}/hostnames/${role}_${instance_id}.txt"
+    echo "hostname saved to ${work_dir}/hostnames/${role}_${instance_id}.txt"
 fi

51-54: Quote command arguments in the non-NSYS path

Avoids word splitting and globbing.

-    trtllm-llmapi-launch trtllm-serve ${model_path} --host $(hostname) --port ${port} --extra_llm_api_options ${config_file}
+    trtllm-llmapi-launch trtllm-serve "${model_path}" --host "$(hostname)" --port "${port}" --extra_llm_api_options "${config_file}"

55-70: Build NSYS prefix as an array; quote everything

Fixes SC2089/SC2090 and makes quoting robust.

-    nsys_prefix=""
-    nsys_file=${nsys_folder}/nsys_worker_proc_${instance_id}_${SLURM_PROCID}
+    nsys_args=()
+    nsys_file="${nsys_folder}/nsys_worker_proc_${instance_id}_${SLURM_PROCID}"
@@
-        nsys_prefix="nsys profile -e \"NSYS_MPI_STORE_TEAMS_PER_RANK=1\" -o ${nsys_file} -f true -t cuda,nvtx,python-gil -c cudaProfilerApi --cuda-graph-trace node --capture-range-end=stop --gpu-metrics-devices=none"
-        echo "nsys_prefix: ${nsys_prefix}"
+        nsys_args=(nsys profile -e NSYS_MPI_STORE_TEAMS_PER_RANK=1 -o "${nsys_file}" -f true -t cuda,nvtx,python-gil -c cudaProfilerApi --cuda-graph-trace node --capture-range-end=stop --gpu-metrics-devices=none)
+        echo "nsys_args: ${nsys_args[*]}"
@@
-    trtllm-llmapi-launch ${nsys_prefix} \
-        trtllm-serve ${model_path} \
-            --host $(hostname) --port ${port} \
-            --extra_llm_api_options ${config_file}
+    trtllm-llmapi-launch "${nsys_args[@]}" \
+        trtllm-serve "${model_path}" \
+            --host "$(hostname)" --port "${port}" \
+            --extra_llm_api_options "${config_file}"

31-40: Check config file exists after derivation

Fail fast if the YAML is missing.

 echo "config_file: ${config_file}"
+if [ ! -f "${config_file}" ]; then
+    echo "Config file not found: ${config_file}" >&2
+    exit 1
+fi

examples/disaggregated/slurm/benchmark/disaggr_torch.slurm (2)

10-12: Enable pipefail

Prevents pipeline failures from being masked (e.g., with tee).

 set -u
 set -e
+set -o pipefail
 set -x

171-183: Bound ctx_nodes slice and add capacity check

Avoids over-slicing and undefined behavior when requested nodes exceed allocation.

 all_nodes=($(scontrol show hostname $SLURM_NODELIST | sort))
 total_nodes_num=${#all_nodes[@]}
 echo "all_nodes: ${all_nodes[@]}, total_nodes_num: ${total_nodes_num}"
 
 # get the node list for the gen workers
 total_gen_nodes_num=$((gen_nodes_num * num_gen_servers))
 gen_nodes=(${all_nodes[@]:0:${total_gen_nodes_num}})
 echo "gen_nodes: ${gen_nodes[@]}, total_gen_nodes_num: ${total_gen_nodes_num}"
 
 # get the node list for the ctx workers
 total_ctx_nodes_num=$((ctx_nodes_num * num_ctx_servers))
-ctx_nodes=(${all_nodes[@]:${total_gen_nodes_num}:${total_nodes_num}})
+if (( total_gen_nodes_num + total_ctx_nodes_num > total_nodes_num )); then
+    echo "Error: Requested nodes (gen:${total_gen_nodes_num} + ctx:${total_ctx_nodes_num}) exceed allocated nodes (${total_nodes_num})." >&2
+    exit 1
+fi
+ctx_nodes=(${all_nodes[@]:${total_gen_nodes_num}:${total_ctx_nodes_num}})
 echo "ctx_nodes: ${ctx_nodes[@]}, total_ctx_nodes_num: ${total_ctx_nodes_num}"

examples/disaggregated/slurm/benchmark/README.md (1)

126-129: Fix startup order (server before workers) or document worker retry semantics.

Current steps launch workers before the server, which can fail without retry/wait logic. Recommend flipping steps 6 and 7 for safer default. If you keep workers-first, please document and implement worker-side retries.
-5.  `disaggr_torch.slurm` runs `gen_worker_config.py` to create worker configuration files.
-6.  `disaggr_torch.slurm` uses `srun` to launch `start_worker.sh` on all nodes, starting the MPI workers for both context and generation phases.
-7.  `disaggr_torch.slurm` starts the main `trtllm-serve` process using `start_server.sh`, which generates the server configuration using `gen_server_config.py`.
+5.  `disaggr_torch.slurm` runs `gen_worker_config.py` to create worker configuration files.
+6.  `disaggr_torch.slurm` starts the main `trtllm-serve` process using `start_server.sh`, which generates the server configuration via `gen_server_config.py`.
+7.  `disaggr_torch.slurm` uses `srun` to launch `start_worker.sh` on all nodes, starting the MPI workers for both context and generation phases.
If you prefer workers-first, add a note after step 7: “Workers retry until the server is reachable (e.g., polling /health).”

🧹 Nitpick comments (12)

examples/disaggregated/slurm/benchmark/gen_server_config.py (3)

32-35: Exit cleanly on missing work_dir

Raise SystemExit for a CLI tool; current ValueError produces a traceback.

-    if not os.path.exists(args.work_dir):
-        raise ValueError(f"Work directory {args.work_dir} not found")
+    if not os.path.exists(args.work_dir):
+        raise SystemExit(f"Work directory {args.work_dir} not found")

65-71: Replace prints with logging for hostname and current host

Keeps logs consistent.

-    print(f"ctx_hostnames: {ctx_hostnames}")
-    print(f"gen_hostnames: {gen_hostnames}")
+    logger.debug("ctx_hostnames: %s", ctx_hostnames)
+    logger.debug("gen_hostnames: %s", gen_hostnames)
@@
-    print(f"Current hostname: {hostname}")
+    logger.info("Current hostname: %s", hostname)

86-90: Use UTF-8 encoding and quiet YAML formatting

Encoding avoids locale issues; sort_keys=False preserves order; replace print with logging.

-    with open(os.path.join(args.work_dir, "server_config.yaml"), "w") as f:
-        yaml.dump(server_config, f)
-    print(
-        f"Server config file {os.path.join(args.work_dir, 'server_config.yaml')} generated"
-    )
+    out_path = os.path.join(args.work_dir, "server_config.yaml")
+    with open(out_path, "w", encoding="utf-8") as f:
+        yaml.dump(server_config, f, default_flow_style=False, sort_keys=False)
+    logger.info("Server config file generated at %s", out_path)

examples/disaggregated/slurm/benchmark/gen_worker_config.py (1)

119-124: Open YAML with UTF-8 and consistent dumper options

Encoding prevents locale issues.

-        with open(moe_load_balancer_file, "w") as f:
-            yaml.dump(moe_load_balancer_config,
-                      f,
-                      default_flow_style=False,
-                      sort_keys=False)
+        with open(moe_load_balancer_file, "w", encoding="utf-8") as f:
+            yaml.dump(moe_load_balancer_config, f, default_flow_style=False, sort_keys=False)

examples/disaggregated/slurm/benchmark/disaggr_torch.slurm (1)

187-211: Consider constraining srun to explicit node lists

You compute gen_nodes/ctx_nodes but do not use them; concurrent sruns may contend on the same nodes. Add --nodelist per group if segregation is required.

Would you like me to propose a per-loop nodelist assignment (partition gen_nodes/ctx_nodes slices per server index)?

examples/disaggregated/slurm/benchmark/run_benchmark.sh (4)

28-44: Quote config path in wait loop

Avoids word-splitting/globbing edge cases.

-while [ ! -f ${config_file} ]; do
+while [ ! -f "${config_file}" ]; do

46-54: Anchor YAML field extraction and quote variables

Reduces false matches and parsing errors.

-hostname=$(grep -i "hostname:" ${config_file} | awk '{print $2}')
-port=$(grep -i "port:" ${config_file} | awk '{print $2}')
+hostname=$(grep -iE '^\s*hostname:\s*' "${config_file}" | awk '{print $2}')
+port=$(grep -iE '^\s*port:\s*' "${config_file}" | awk '{print $2}')
 if [ -z "$hostname" ] || [ -z "$port" ]; then

80-97: Quote glob loops and outputs in do_get_logs

Prevents word splitting and handles no-match safely.

-    for gen_file in ${log_path}/output_gen_*.log; do
-        if [ -f "$gen_file" ]; then
+    shopt -s nullglob
+    for gen_file in "${log_path}"/output_gen_*.log; do
+        if [ -f "${gen_file}" ]; then
             index=$(basename "$gen_file" | sed 's/output_gen_\(.*\)\.log/\1/')
-            grep -a "'num_ctx_requests': 0, 'num_ctx_tokens': 0" "$gen_file" > "${output_folder}/gen_only_${index}.txt" || true
+            grep -a "'num_ctx_requests': 0, 'num_ctx_tokens': 0" "${gen_file}" > "${output_folder}/gen_only_${index}.txt" || true
         fi
     done
 
-    for ctx_file in ${log_path}/output_ctx_*.log; do
-        if [ -f "$ctx_file" ]; then
+    for ctx_file in "${log_path}"/output_ctx_*.log; do
+        if [ -f "${ctx_file}" ]; then
             index=$(basename "$ctx_file" | sed 's/output_ctx_\(.*\)\.log/\1/')
-            grep -a "'num_generation_tokens': 0" "$ctx_file" > "${output_folder}/ctx_only_${index}.txt" || true
+            grep -a "'num_generation_tokens': 0" "${ctx_file}" > "${output_folder}/ctx_only_${index}.txt" || true
         fi
     done

99-123: Quote paths and args during runs and log aggregation

Minor robustness improvements.

-    mkdir -p ${log_path}/concurrency_${concurrency}
+    mkdir -p "${log_path}/concurrency_${concurrency}"
@@
-        --dataset-path ${shared_gpt_path} \
+        --dataset-path "${shared_gpt_path}" \
@@
-    do_get_logs ${log_path} ${log_path}/concurrency_${concurrency}
-    echo "done for ${concurrency} in folder ${log_path}/concurrency_${concurrency}"
+    do_get_logs "${log_path}" "${log_path}/concurrency_${concurrency}"
+    echo "done for ${concurrency} in folder ${log_path}/concurrency_${concurrency}"

examples/disaggregated/slurm/benchmark/README.md (3)

10-10: Align description with server-first startup order.

Clarify the sequence explicitly to avoid race conditions.

-2.  `disaggr_torch.slurm`: The SLURM script that sets up and runs a single benchmark experiment. It launches a container, generates configuration files, starts the server and workers, and runs the benchmark client.
+2.  `disaggr_torch.slurm`: The SLURM script that sets up and runs a single benchmark experiment. It launches a container, generates configuration files, starts the server (via `start_server.sh`), then launches workers (via `start_worker.sh`), and runs the benchmark client.

14-14: Grammar nit: add article.

-6.  `start_server.sh`: A shell script responsible for starting disaggregated server using `trtllm-serve` on each allocated machine.
+6.  `start_server.sh`: A shell script responsible for starting the disaggregated server using `trtllm-serve` on each allocated machine.

63-70: Call out deprecation/replacement to prevent confusion.

Explicitly note this replaces the previous single-config generator to guide users migrating from older docs/scripts.

 This Python script generates the worker configuration YAML file that configures the `trtllm-serve` workers. It creates separate configurations for context and generation workers with different tensor parallelism, batch sizes, and other parameters.
+It replaces the previous `gen_yaml.py`-based workflow.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 24453fb and 0eac701.

📒 Files selected for processing (8)

examples/disaggregated/slurm/benchmark/README.md (3 hunks)
examples/disaggregated/slurm/benchmark/disaggr_torch.slurm (6 hunks)
examples/disaggregated/slurm/benchmark/gen_server_config.py (1 hunks)
examples/disaggregated/slurm/benchmark/gen_worker_config.py (1 hunks)
examples/disaggregated/slurm/benchmark/gen_yaml.py (0 hunks)
examples/disaggregated/slurm/benchmark/run_benchmark.sh (4 hunks)
examples/disaggregated/slurm/benchmark/start_server.sh (1 hunks)
examples/disaggregated/slurm/benchmark/start_worker.sh (2 hunks)

💤 Files with no reviewable changes (1)

examples/disaggregated/slurm/benchmark/gen_yaml.py

🚧 Files skipped from review as they are similar to previous changes (1)

examples/disaggregated/slurm/benchmark/start_server.sh

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Code must target Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Preserve module namespaces when importing; import modules/packages and access members via the module (e.g., from package.subpackage import foo; foo.SomeClass())
Python file names should be snake_case
Python class names should be PascalCase
Python functions/methods and local variables should be snake_case; variables beginning with a number should be prefixed with k_ (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE prefixed with G_ (e.g., G_MY_GLOBAL); constants should be UPPER_SNAKE_CASE
Avoid shadowing variables from outer scopes; initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; comments should be reserved for in-function or file-local interfaces
Use Google-style docstrings for classes and functions; attributes and variables may be documented inline with trailing string literals
Avoid reflection when simpler, explicit code suffices (e.g., avoid dict(**locals()) patterns)
In try/except, catch the narrowest exceptions possible
For duck-typing patterns, keep the try body minimal and move logic to else to avoid masking unrelated failures

Files:

examples/disaggregated/slurm/benchmark/gen_server_config.py
examples/disaggregated/slurm/benchmark/gen_worker_config.py

**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)

Files:

examples/disaggregated/slurm/benchmark/gen_server_config.py
examples/disaggregated/slurm/benchmark/gen_worker_config.py

🧠 Learnings (1)

📚 Learning: 2025-08-25T22:42:47.587Z

Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-25T22:42:47.587Z
Learning: Applies to **/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py} : Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)

Applied to files:

examples/disaggregated/slurm/benchmark/gen_worker_config.py

🪛 Ruff (0.12.2)

examples/disaggregated/slurm/benchmark/gen_server_config.py

47-47: Line too long (168 > 120)

(E501)

🪛 Shellcheck (0.10.0)

examples/disaggregated/slurm/benchmark/start_worker.sh

[warning] 46-46: Quote this to prevent word splitting.

(SC2046)

[warning] 53-53: Quote this to prevent word splitting.

(SC2046)

[warning] 61-61: Quotes/backslashes will be treated literally. Use an array.

(SC2089)

[warning] 66-66: Quotes/backslashes in this variable will not be respected.

(SC2090)

[warning] 68-68: Quote this to prevent word splitting.

(SC2046)

🪛 LanguageTool

examples/disaggregated/slurm/benchmark/README.md

[grammar] ~14-~14: There might be a mistake here.
Context: ...ver.sh: A shell script responsible for starting disaggregated server using trtllm-serv...

(QB_NEW_EN)

[grammar] ~14-~14: There might be a mistake here.
Context: ...trtllm-serveon each allocated machine. 7. run_benchmark.sh`: A shell script that waits for the serv...

(QB_NEW_EN)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

examples/disaggregated/slurm/benchmark/gen_worker_config.py

Signed-off-by: Kaiyu Xie <[email protected]>

kaiyux · 2025-08-27T10:13:50Z

/bot skip --comment "the example slurm scripts are not protected by CI pipeline"

tensorrt-cicd · 2025-08-27T10:19:01Z

PR_Github #16676 [ skip ] triggered by Bot

tensorrt-cicd · 2025-08-27T10:35:55Z

PR_Github #16676 [ skip ] completed with state SUCCESS
Skipping testing for commit af7dde0

Shixiaowei02 requested review from qiaoxj07, Shixiaowei02 and xxi-nv August 21, 2025 09:57

coderabbitai bot reviewed Aug 21, 2025

View reviewed changes

xxi-nv reviewed Aug 21, 2025

View reviewed changes

examples/disaggregated/slurm/benchmark/disaggr_torch.slurm Show resolved Hide resolved

examples/disaggregated/slurm/benchmark/disaggr_torch.slurm Show resolved Hide resolved

examples/disaggregated/slurm/benchmark/get_server_config.py Show resolved Hide resolved

kaiyux changed the title ~~Make disagg example compatible with recommended usage~~ [None] [chore] Make disagg example compatible with recommended usage Aug 21, 2025

coderabbitai bot reviewed Aug 22, 2025

View reviewed changes

examples/disaggregated/slurm/benchmark/disaggr_torch.slurm Show resolved Hide resolved

coderabbitai bot reviewed Aug 22, 2025

View reviewed changes

kaiyux marked this pull request as ready for review August 22, 2025 07:56

kaiyux requested review from a team as code owners August 22, 2025 07:56

kaiyux requested review from chuangz0 and chzblych August 22, 2025 07:56

coderabbitai bot reviewed Aug 23, 2025

View reviewed changes

examples/disaggregated/slurm/benchmark/README.md Show resolved Hide resolved

coderabbitai bot reviewed Aug 23, 2025

View reviewed changes

examples/disaggregated/slurm/benchmark/disaggr_torch.slurm Show resolved Hide resolved

examples/disaggregated/slurm/benchmark/start_worker.sh Outdated Show resolved Hide resolved

examples/disaggregated/slurm/benchmark/start_worker.sh Outdated Show resolved Hide resolved

coderabbitai bot reviewed Aug 23, 2025

View reviewed changes

kaiyux requested a review from dongxuy04 August 24, 2025 01:26

kaiyux force-pushed the user/kaiyu/compat_disagg_usage branch from 7b9a11c to 85f2bb8 Compare August 25, 2025 00:36

coderabbitai bot reviewed Aug 25, 2025

View reviewed changes

examples/disaggregated/slurm/benchmark/start_worker.sh Show resolved Hide resolved

coderabbitai bot reviewed Aug 25, 2025

View reviewed changes

examples/disaggregated/slurm/benchmark/disaggr_torch.slurm Show resolved Hide resolved

coderabbitai bot reviewed Aug 25, 2025

View reviewed changes

coderabbitai bot reviewed Aug 26, 2025

View reviewed changes

dongxuy04 approved these changes Aug 27, 2025

View reviewed changes

kaiyux force-pushed the user/kaiyu/compat_disagg_usage branch from 24453fb to 0eac701 Compare August 27, 2025 08:39

coderabbitai bot reviewed Aug 27, 2025

View reviewed changes

examples/disaggregated/slurm/benchmark/gen_worker_config.py Show resolved Hide resolved

Fix style

af7dde0

Signed-off-by: Kaiyu Xie <[email protected]>

kaiyux enabled auto-merge (squash) August 27, 2025 10:13

Shixiaowei02 approved these changes Aug 27, 2025

View reviewed changes

kaiyux merged commit 8a619be into NVIDIA:main Aug 27, 2025
5 checks passed

kaiyux deleted the user/kaiyu/compat_disagg_usage branch August 27, 2025 16:16

coderabbitai bot mentioned this pull request Sep 1, 2025

[TRTLLM-7187][fix] Add nixl to PATH #7445

Closed

1 task

[None] [chore] Make disagg example compatible with recommended usage #7121

[None] [chore] Make disagg example compatible with recommended usage #7121

Uh oh!

Conversation

kaiyux commented Aug 21, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Test Coverage

GitHub Bot Help

kill

skip

reuse-pipeline

Uh oh!

coderabbitai bot commented Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

Status, Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

kaiyux commented Aug 21, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Aug 21, 2025 •

edited

Loading