-
Notifications
You must be signed in to change notification settings - Fork 1.8k
[None] [chore] Make disagg example compatible with recommended usage #7121
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
📝 WalkthroughWalkthroughReplaces the monolithic SLURM YAML workflow with per-role worker/server config generators and orchestration: adds strict shell modes, per-group node assignment and logs, per-worker NSYS profiling support, new Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant SLURM as disaggr_torch.slurm
participant GEN as GEN workers (srun)
participant CTX as CTX workers (srun)
participant START_SRV as start_server.sh
participant GEN_SRV_CFG as gen_server_config.py
participant SERVER as trtllm-serve
participant BENCH as run_benchmark.sh
SLURM->>SLURM: enumerate nodes, compute ntasks_per_node\nsplit nodes into gen_nodes / ctx_nodes
SLURM->>GEN: srun -> start_worker.sh "GEN" index ... work_dir
SLURM->>CTX: srun -> start_worker.sh "CTX" index ... work_dir
GEN->>GEN: write hostname file (GEN_index)\nlaunch trtllm-serve via llmapi-launch (optional NSYS)
CTX->>CTX: write hostname file (CTX_index)\nlaunch trtllm-serve via llmapi-launch
SLURM->>START_SRV: start_server.sh num_ctx num_gen work_dir script_dir
START_SRV->>GEN_SRV_CFG: poll work_dir/hostnames/*\nassemble server_config.yaml
GEN_SRV_CFG->>START_SRV: server_config.yaml ready
START_SRV->>SERVER: trtllm-serve -c server_config.yaml
SLURM->>BENCH: run_benchmark.sh ${logdir}
BENCH->>SERVER: wait for /health OK\nrun workload
BENCH->>BENCH: collect per-worker logs (output_gen_*.log / output_ctx_*.log)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Tip 🔌 Remote MCP (Model Context Protocol) integration is now available!Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats. 📜 Recent review detailsConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro 💡 Knowledge Base configuration:
You can enable these sources in your CodeRabbit configuration. 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
Status, Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 9
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (4)
examples/disaggregated/slurm/benchmark/run_benchmark.sh (1)
64-78
: Health check loop is logically incorrect; it won’t wait on non-200 responsesUsing “while ! curl …” only checks curl’s exit status, not the HTTP code. 4xx/5xx will pass immediately. Compare the code to 200 explicitly.
-# check server is health by curl every 10 seconds timeout 1800 seconds -timeout=1800 -start_time=$(date +%s) -while ! curl -s -o /dev/null -w "%{http_code}" http://${hostname}:${port}/health; do - current_time=$(date +%s) - elapsed=$((current_time - start_time)) - if [ $elapsed -ge $timeout ]; then - echo "Error: Server is not healthy after ${timeout} seconds" - exit 1 - fi - if [ $((elapsed % 30)) -eq 0 ]; then - echo "Waiting for server to be healthy... (${elapsed}s elapsed)" - fi - sleep 10 -done +# Check server health every 10s, timeout 1800s +timeout=1800 +start_time=$(date +%s) +while true; do + http_code=$(curl -s -o /dev/null -w "%{http_code}" "http://${hostname}:${port}/health" || echo "000") + if [ "${http_code}" = "200" ]; then + break + fi + current_time=$(date +%s) + elapsed=$((current_time - start_time)) + if [ $elapsed -ge $timeout ]; then + echo "Error: Server is not healthy after ${timeout} seconds (last code=${http_code})" + exit 1 + fi + if [ $((elapsed % 30)) -eq 0 ]; then + echo "Waiting for server to be healthy... (${elapsed}s elapsed, last code=${http_code})" + fi + sleep 10 +doneexamples/disaggregated/slurm/benchmark/gen_yaml.py (2)
7-23
: Unify default for cache_transceiver_max_num_tokens with CLI (8448)Function default is 4608 while CLI default is 8448. Aligning avoids surprises for direct callers.
- cache_transceiver_max_num_tokens: int = 4608) -> None: + cache_transceiver_max_num_tokens: int = 8448) -> None:
23-45
: Docstring is stale and references removed parametersDocstring parameters (e.g., config_path, model_path, num_ctx_servers, server_port) no longer exist. Update to the current signature so downstream users don’t misconfigure.
- """ - Generate configuration YAML file for disaggregated inference. - - Args: - config_path: Path to save the config file - model_path: Path to the model - num_ctx_servers: Number of context servers - ctx_tp_size: Tensor parallel size for context servers - ctx_batch_size: Batch size for context servers - ctx_max_num_tokens: Max number of tokens for context servers - ctx_max_seq_len: Max sequence length for context servers - ctx_free_gpu_memory_fraction: Free GPU memory fraction for context servers - ctx_enable_attention_dp: Enable attention DP for context servers - num_gen_servers: Number of generation servers - gen_tp_size: Tensor parallel size for generation servers - gen_batch_size: Batch size for generation servers - gen_max_num_tokens: Max number of tokens for generation servers - gen_enable_attention_dp: Enable attention DP for generation servers - gen_gpu_memory_fraction: GPU memory fraction for generation servers - eplb_num_slots: Number of slots for eplb - worker_start_port: Start port for workers - server_port: Server port - """ + """ + Generate ctx/gen YAML configs for disaggregated inference and write them to work_dir. + + Args: + work_dir: Output directory for ctx_config.yaml and gen_config.yaml. + ctx_tp_size: TP size for context servers. + ctx_batch_size: Max batch size for context servers. + ctx_max_num_tokens: Max tokens for context servers. + ctx_max_seq_len: Max sequence length for context servers. + ctx_free_gpu_memory_fraction: KV cache free memory fraction (0..1) for context servers. + ctx_enable_attention_dp: Whether to enable Attention DP for context servers. + gen_tp_size: TP size for generation servers. + gen_batch_size: Max batch size for generation servers. + gen_max_num_tokens: Max tokens for generation servers. + gen_max_seq_len: Max sequence length for generation servers. + gen_enable_attention_dp: Whether to enable Attention DP for generation servers. + gen_gpu_memory_fraction: KV cache free memory fraction (0..1) for generation servers. + eplb_num_slots: MOE load-balancer slots; if > 0, writes moe_load_balancer.yaml and references it. + mtp_size: Number of layers for MTP speculative decoding; if > 0, adds speculative_config to both. + cache_transceiver_max_num_tokens: Max tokens in cache transceiver buffer (both roles). + """examples/disaggregated/slurm/benchmark/disaggr_torch.slurm (1)
153-155
: Stale reference to config.yaml (now split configs); remove or updateThe pipeline now generates ctx_config.yaml/gen_config.yaml (and server_config.yaml later). Reading ${full_logdir}/config.yaml will fail. Unless another step writes this file, drop these lines or point to the correct server_config.yaml when it exists.
Suggested change:
-hostname_value=$(grep '^hostname:' ${full_logdir}/config.yaml | awk -F': ' '{print $2}') -echo "server host name: $hostname_value" +# Server hostname is derived from server_config.yaml created by start_server.sh/get_server_config.py.
🧹 Nitpick comments (14)
examples/disaggregated/slurm/benchmark/get_server_config.py (3)
1-7
: Add NVIDIA 2025 copyright headerPer repository guidelines, prepend the NVIDIA copyright header to Python sources.
+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. + import argparse import os import socket import time
47-47
: Fix long line flagged by Ruff (E501)Break the long f-string into shorter parts.
- print( - f"Waiting for hostnames to be found in {hostnames_folder}, current length: {len(hostnames)}, expected length: {args.num_ctx_servers + args.num_gen_servers}" - ) + print( + "Waiting for hostnames to be found in " + f"{hostnames_folder}, current length: {len(hostnames)}, " + f"expected length: {args.num_ctx_servers + args.num_gen_servers}" + )
86-90
: Use safe_dump and explicit UTF-8 encoding for YAML outputSafer and yields stable formatting.
- with open(os.path.join(args.work_dir, "server_config.yaml"), "w") as f: - yaml.dump(server_config, f) + out_path = os.path.join(args.work_dir, "server_config.yaml") + with open(out_path, "w", encoding="utf-8") as f: + yaml.safe_dump(server_config, f, default_flow_style=False, sort_keys=False) - print( - f"Server config file {os.path.join(args.work_dir, 'server_config.yaml')} generated" - ) + print(f"Server config file {out_path} generated")examples/disaggregated/slurm/benchmark/run_benchmark.sh (3)
35-46
: Quote paths when waiting for config file to avoid glob/space issuesMinor robustness tweak.
-while [ ! -f ${config_file} ]; do +while [ ! -f "${config_file}" ]; do
49-56
: YAML parsing by grep is brittle; minimally constrain matchesIf comments or similarly named keys are added, this can misparse. Consider yq or Python for robust parsing. As a minimal fix, anchor to start and ignore commented lines.
-hostname=$(grep -i "hostname:" ${config_file} | awk '{print $2}') -port=$(grep -i "port:" ${config_file} | awk '{print $2}') +hostname=$(grep -iE '^[[:space:]]*hostname:' "${config_file}" | grep -v '^[[:space:]]*#' | awk '{print $2}') +port=$(grep -iE '^[[:space:]]*port:' "${config_file}" | grep -v '^[[:space:]]*#' | awk '{print $2}')If you prefer a robust approach without extra dependencies, I can replace this with a tiny python -c snippet that safe_loads the YAML.
127-138
: Avoid kill -9 and limit scope to the job when possibleSIGKILL skips cleanup and the grep patterns may hit unrelated processes. Prefer TERM with a timeout; and restrict to the current user or SLURM job.
-kill -9 $(ps aux | grep '[s]tart_server.sh' | awk '{print $2}') >/dev/null 2>&1 || true -kill -9 $(ps aux | grep '[s]tart_worker.sh' | awk '{print $2}') >/dev/null 2>&1 || true -kill -9 $(ps aux | grep '[t]rtllm-serve' | awk '{print $2}') >/dev/null 2>&1 || true -sleep 20 # Give processes some time to clean up +pkill -TERM -u "${USER}" -f '[s]tart_server.sh' >/dev/null 2>&1 || true +pkill -TERM -u "${USER}" -f '[s]tart_worker.sh' >/dev/null 2>&1 || true +pkill -TERM -u "${USER}" -f '[t]rtllm-serve' >/dev/null 2>&1 || true +sleep 20 # Give processes some time to clean up +# Force kill only if still alive +pgrep -u "${USER}" -f '[t]rtllm-serve' >/dev/null && pkill -KILL -u "${USER}" -f '[t]rtllm-serve' || trueexamples/disaggregated/slurm/benchmark/gen_yaml.py (2)
1-4
: Add NVIDIA 2025 copyright headerComply with the project’s source header requirement.
+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. + import argparse import os
132-139
: Ensure work_dir exists and use safe_dump with UTF-8Create the directory, write YAML using safe_dump, and disable key sorting for readability.
- ctx_config_file = os.path.join(work_dir, "ctx_config.yaml") - gen_config_file = os.path.join(work_dir, "gen_config.yaml") - with open(ctx_config_file, "w") as f: - yaml.dump(ctx_config, f, default_flow_style=False, sort_keys=False) - with open(gen_config_file, "w") as f: - yaml.dump(gen_config, f, default_flow_style=False, sort_keys=False) + os.makedirs(work_dir, exist_ok=True) + ctx_config_file = os.path.join(work_dir, "ctx_config.yaml") + gen_config_file = os.path.join(work_dir, "gen_config.yaml") + with open(ctx_config_file, "w", encoding="utf-8") as f: + yaml.safe_dump(ctx_config, f, default_flow_style=False, sort_keys=False) + with open(gen_config_file, "w", encoding="utf-8") as f: + yaml.safe_dump(gen_config, f, default_flow_style=False, sort_keys=False)examples/disaggregated/slurm/benchmark/start_worker.sh (2)
43-46
: Quote variable expansions when writing hostnamesPrevents word splitting and globbing issues; also avoid an unnecessary subshell.
- mkdir -p ${work_dir}/hostnames/ - echo $(hostname) > ${work_dir}/hostnames/${role}_${instance_id}.txt - echo "hostname saved to ${work_dir}/hostnames/${role}_${instance_id}.txt" + mkdir -p "${work_dir}/hostnames" + hostname > "${work_dir}/hostnames/${role}_${instance_id}.txt" + echo "hostname saved to ${work_dir}/hostnames/${role}_${instance_id}.txt"
49-68
: NSYS prefix handling and quotingShellCheck warnings (SC2089/SC2090/SC2046) indicate quoting will be treated literally and word-splitting may occur. Use arrays for NSYS args and quote expansions elsewhere.
-if [ -z "${nsys_folder}" ]; then +if [ -z "${nsys_folder}" ]; then echo "nsys is not enabled, start normal flow" - trtllm-llmapi-launch trtllm-serve ${model_path} --host $(hostname) --port ${port} --extra_llm_api_options ${config_file} + trtllm-llmapi-launch trtllm-serve "${model_path}" --host "$(hostname)" --port "${port}" --extra_llm_api_options "${config_file}" else nsys_prefix="" - nsys_file=${nsys_folder}/nsys_worker_proc_${instance_id}_${SLURM_PROCID} + nsys_file="${nsys_folder}/nsys_worker_proc_${instance_id}_${SLURM_PROCID}" export TLLM_PROFILE_RECORD_GC=1 export TLLM_NVTX_DEBUG=1 if [ "${role}" = "GEN" ]; then export TLLM_PROFILE_START_STOP=200-250 - nsys_prefix="nsys profile -e \"NSYS_MPI_STORE_TEAMS_PER_RANK=1\" -o ${nsys_file} -f true -t cuda,nvtx,python-gil -c cudaProfilerApi --cuda-graph-trace node --capture-range-end=stop --gpu-metrics-devices=none" - echo "nsys_prefix: ${nsys_prefix}" + # Build NSYS args as an array to preserve quoting: + nsys_args=(profile -e "NSYS_MPI_STORE_TEAMS_PER_RANK=1" -o "${nsys_file}" -f true -t cuda,nvtx,python-gil -c cudaProfilerApi --cuda-graph-trace node --capture-range-end=stop --gpu-metrics-devices=none) + echo "nsys_file: ${nsys_file}" elif [ "${role}" = "CTX" ]; then echo "nsys is not enabled on ctx_gpus" fi - trtllm-llmapi-launch ${nsys_prefix} \ - trtllm-serve ${model_path} \ - --host $(hostname) --port ${port} \ - --extra_llm_api_options ${config_file} + if [ "${role}" = "GEN" ]; then + trtllm-llmapi-launch nsys "${nsys_args[@]}" \ + trtllm-serve "${model_path}" \ + --host "$(hostname)" --port "${port}" \ + --extra_llm_api_options "${config_file}" + else + trtllm-llmapi-launch trtllm-serve "${model_path}" \ + --host "$(hostname)" --port "${port}" \ + --extra_llm_api_options "${config_file}" + fi fiexamples/disaggregated/slurm/benchmark/disaggr_torch.slurm (4)
95-96
: Typo in log directory label ("tep" vs. "tp") will fragment result pathsThe suffix should be consistent; use "tp" for tensor parallel size. Right now line 82 uses "dep" and this branch uses "tep". Pick one ("tp" is conventional) across both cases.
Apply this diff here, and mirror the same rename at Line 82 to keep the directory pattern stable:
- full_logdir=${logdir}/ctx${num_ctx_servers}_gen${num_gen_servers}_tep${gen_tp_size}_batch${gen_batch_size}_eplb${eplb_num_slots}_mtp${mtp_size} + full_logdir=${logdir}/ctx${num_ctx_servers}_gen${num_gen_servers}_tp${gen_tp_size}_batch${gen_batch_size}_eplb${eplb_num_slots}_mtp${mtp_size}
158-159
: pid_list is collected but unused; add a trap to ensure cleanup of background srunsWithout cleanup, backgrounded sruns may linger if the job aborts. Add a trap to kill any remaining PIDs on EXIT/INT/TERM.
Inject right after pid_list initialization:
pid_list="" +cleanup() { + for pid in $pid_list; do + if kill -0 "$pid" >/dev/null 2>&1; then + echo "Cleaning up PID $pid" + kill "$pid" || true + fi + done +} +trap cleanup EXIT INT TERM
121-126
: Install step only runs in the base container; per-worker/server containers may miss TRT-LLM depsYou install with --container-name=${container_name} ("disaggr"), but workers/servers run in distinct containers (container_name_ctx_, container_name_gen_, container_name_server). Unless the image is pre-baked, those containers won’t see the editable install.
Options:
- Pre-bake the image with TRT-LLM and remove the on-cluster install.
- Or replicate the install for each container type before launching workers/servers, e.g.:
# Example: install in all upcoming container names (adjust counts accordingly) for i in $(seq 0 $((num_ctx_servers - 1))); do srun --container-name=${container_name}_ctx_${i} --container-image=${container_image} \ --container-mounts=${mounts} --mpi=pmix --overlap -N 1 -n 1 \ bash -c "cd ${trtllm_repo} && pip install -e ." done for i in $(seq 0 $((num_gen_servers - 1))); do srun --container-name=${container_name}_gen_${i} --container-image=${container_image} \ --container-mounts=${mounts} --mpi=pmix --overlap -N 1 -n 1 \ bash -c "cd ${trtllm_repo} && pip install -e ." done srun --container-name=${container_name}_server --container-image=${container_image} \ --container-mounts=${mounts} --mpi=pmix --overlap -N 1 -n 1 \ bash -c "cd ${trtllm_repo} && pip install -e ."
128-149
: gen_yaml.py flags are correct; add file existence checksI confirmed via
rg
thatgen_yaml.py
defines both--ctx_free_gpu_memory_fraction
(line 166) and--gen_gpu_memory_fraction
(line 194), matching the flags passed in the SLURM script—no renaming needed.Next, to fail fast if the YAML artifacts aren’t generated, add a post-generation check. For example, immediately after the
srun
inexamples/disaggregated/slurm/benchmark/disaggr_torch.slurm
, insert:# Verify that both config files were created if [[ ! -f "${full_logdir}/ctx_config.yaml" || ! -f "${full_logdir}/gen_config.yaml" ]]; then echo "ERROR: Missing ctx/gen YAMLs in ${full_logdir}" >&2 exit 1 fiThis ensures that any failure in
gen_yaml.py
is caught early rather than silently proceeding.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (6)
examples/disaggregated/slurm/benchmark/disaggr_torch.slurm
(3 hunks)examples/disaggregated/slurm/benchmark/gen_yaml.py
(5 hunks)examples/disaggregated/slurm/benchmark/get_server_config.py
(1 hunks)examples/disaggregated/slurm/benchmark/run_benchmark.sh
(3 hunks)examples/disaggregated/slurm/benchmark/start_server.sh
(1 hunks)examples/disaggregated/slurm/benchmark/start_worker.sh
(2 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py
: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else
Files:
examples/disaggregated/slurm/benchmark/get_server_config.py
examples/disaggregated/slurm/benchmark/gen_yaml.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
examples/disaggregated/slurm/benchmark/get_server_config.py
examples/disaggregated/slurm/benchmark/gen_yaml.py
🪛 Ruff (0.12.2)
examples/disaggregated/slurm/benchmark/get_server_config.py
47-47: Line too long (168 > 120)
(E501)
🪛 Shellcheck (0.10.0)
examples/disaggregated/slurm/benchmark/start_worker.sh
[warning] 44-44: Quote this to prevent word splitting.
(SC2046)
[warning] 51-51: Quote this to prevent word splitting.
(SC2046)
[warning] 59-59: Quotes/backslashes will be treated literally. Use an array.
(SC2089)
[warning] 64-64: Quotes/backslashes in this variable will not be respected.
(SC2090)
[warning] 66-66: Quote this to prevent word splitting.
(SC2046)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (3)
examples/disaggregated/slurm/benchmark/get_server_config.py (1)
68-85
: Confirm schema fields match trtllm-serve expectationsThe keys 'backend', 'context_servers/generation_servers.num_instances', and 'urls' should match disaggregated server schema. Please confirm that:
- backend: 'pytorch' is valid here
- urls expect "host:port" without protocol
- num_instances must equal len(urls)
If the schema differs, the server may fail to load.
Would you like me to add a quick YAML schema check in CI to validate the generated file before server launch?
examples/disaggregated/slurm/benchmark/gen_yaml.py (2)
105-107
: Confirm 'MNNVL' allreduce strategy identifierThe string "MNNVL" may be a specific strategy code. Please verify the exact spelling and casing match the runtime’s expected values.
If needed, I can auto-scan for known allreduce strategies in the codebase with a follow-up script.
71-76
: All MOE backend strings are validI’ve confirmed that the three values used in
examples/disaggregated/slurm/benchmark/gen_yaml.py
–"CUTLASS"
,"WIDEEP"
, and"TRTLLM"
– are part of the allowed set in the runtime’s configuration:• In tensorrt_llm/llmapi/llm_args.py, the
MoeConfig.backend
field is defined as
Literal["CUTLASS", "CUTEDSL", "WIDEEP", "TRTLLM", "DEEPGEMM", "VANILLA", "TRITON"]
confirming that each of the three strings is accepted .No changes are required in the snippet; the strings align exactly with the runtime’s enum.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (5)
examples/disaggregated/slurm/benchmark/disaggr_torch.slurm (5)
146-146
: mtp_size now used in YAML generation and log key; earlier concern addressedmtp_size is passed to gen_yaml.py and included in the logdir pattern, so the previous “unused” remark is resolved.
111-113
: nsys variable mismatch: define nsys_folder (workers consume it), not nsys_onWorkers are invoked with ${nsys_folder}, but the script only defines nsys_on; profiling path will be empty and silently misconfigured. Standardize on nsys_folder and (optionally) create the directory when enabled.
Apply:
-nsys_on="" -# nsys_on=${full_logdir} # Uncomment this line to enable Nsys profiling +nsys_folder="" +# nsys_folder=${full_logdir}/nsys # Uncomment to enable Nsys profiling; workers will write profiles here +# [ -n "${nsys_folder}" ] && mkdir -p "${nsys_folder}"
158-170
: CTX worker launch: undefined vars, invalid --segment, and path/name mismatchesProblems:
- ctx_nodes_num and gpus_per_node are not defined anywhere.
- --segment is not a valid srun option.
- Uses ${work_dir} (undefined) instead of ${workdir}.
- Passes ${model_path} (undefined) instead of ${model_dir}.
- Depends on ${nsys_folder} which is currently not defined (see earlier comment).
Minimal safe fix (single node per CTX group; drop invalid/undefined options; fix var names):
for i in $(seq 0 $((num_ctx_servers - 1))); do - srun -l -N ${ctx_nodes_num} \ - --ntasks=${ctx_tp_size} \ - --ntasks-per-node=${gpus_per_node} \ - --segment=${ctx_nodes_num} \ - --container-image=${container_image} \ - --container-name=${container_name} \ - --container-mounts=${mounts} \ - --mpi=pmix \ - bash ${work_dir}/start_worker.sh "CTX" ${i} ${model_path} "8336" ${benchmark_mode} ${concurrency} ${enable_pdl} ${full_logdir} ${nsys_folder} \ - &> ${full_logdir}/output_ctx_${i}.log & + srun -l -N 1 \ + --ntasks=${ctx_tp_size} \ + --container-image=${container_image} \ + --container-name=${container_name} \ + --container-mounts=${mounts} \ + --mpi=pmix \ + bash ${workdir}/start_worker.sh "CTX" ${i} ${model_dir} "8336" ${benchmark_mode} ${concurrency} ${enable_pdl} ${full_logdir} ${nsys_folder} \ + &> ${full_logdir}/output_ctx_${i}.log & pid_list="${pid_list} $!" doneIf you intend true multi-node per CTX group, define per-node GPU capacity and derive ctx_nodes_num before the loop:
# Place near other derived vars: gpus_per_node="${SLURM_GPUS_ON_NODE:-${SLURM_NTASKS_PER_NODE:-1}}" ctx_nodes_num=$(( (ctx_tp_size + gpus_per_node - 1) / gpus_per_node ))Then use: -N "${ctx_nodes_num}" --ntasks="${ctx_tp_size}" --ntasks-per-node="${gpus_per_node}", and keep --segment removed.
Quick check:
#!/bin/bash rg -n 'ctx_nodes_num|gpus_per_node|work_dir\b|model_path\b' examples/disaggregated/slurm/benchmark/disaggr_torch.slurm
173-185
: GEN worker launch: mirror the CTX fixes; same undefined vars/--segment/path issues
- gen_nodes_num and gpus_per_node are not defined.
- --segment is invalid.
- ${model_path} is undefined; use ${model_dir}.
Minimal safe fix:
for i in $(seq 0 $((num_gen_servers - 1))); do - srun -l -N ${gen_nodes_num} \ - --ntasks=${gen_tp_size} \ - --ntasks-per-node=${gpus_per_node} \ - --segment=${gen_nodes_num} \ - --container-image=${container_image} \ - --container-name=${container_name} \ - --container-mounts=${mounts} \ - --mpi=pmix \ - bash ${workdir}/start_worker.sh "GEN" ${i} ${model_path} "8336" ${benchmark_mode} ${concurrency} ${enable_pdl} ${full_logdir} ${nsys_folder} \ - &> ${full_logdir}/output_gen_${i}.log & + srun -l -N 1 \ + --ntasks=${gen_tp_size} \ + --container-image=${container_image} \ + --container-name=${container_name} \ + --container-mounts=${mounts} \ + --mpi=pmix \ + bash ${workdir}/start_worker.sh "GEN" ${i} ${model_dir} "8336" ${benchmark_mode} ${concurrency} ${enable_pdl} ${full_logdir} ${nsys_folder} \ + &> ${full_logdir}/output_gen_${i}.log & pid_list="${pid_list} $!" doneIf multi-node GEN groups are desired, compute:
gpus_per_node="${SLURM_GPUS_ON_NODE:-${SLURM_NTASKS_PER_NODE:-1}}" gen_nodes_num=$(( (gen_tp_size + gpus_per_node - 1) / gpus_per_node ))Then use -N "${gen_nodes_num}" --ntasks="${gen_tp_size}" --ntasks-per-node="${gpus_per_node}" and omit --segment.
189-193
: Server start: work_dir vs workdir mismatch (undefined var will break server start)The final srun passes ${work_dir}, but only ${workdir} is defined elsewhere. Fix the argument to pass the correct path. Also, consider not mixing --container-image and --container-name in the same step unless you intend to start a fresh container for the server step.
Mandatory fix:
srun -l --container-name=${container_name} \ - --container-image=${container_image} \ --container-mounts=${mounts} \ --mpi=pmix --overlap -N 1 -n 1 \ - bash ${workdir}/start_server.sh ${num_ctx_servers} ${num_gen_servers} ${full_logdir} ${work_dir} \ + bash ${workdir}/start_server.sh ${num_ctx_servers} ${num_gen_servers} ${full_logdir} ${workdir} \ &> ${full_logdir}/output_server.log &Optional cleanup: If you want to reuse the already named container, you can drop --container-image here to avoid starting a new container instance for the server step.
🧹 Nitpick comments (2)
examples/disaggregated/slurm/benchmark/disaggr_torch.slurm (2)
1-9
: Make the script fail-fast and clean up reliablyAdd strict mode and a trap to terminate background worker/server steps if any step fails; this prevents hung jobs and orphaned processes.
Apply near the top (after shebang):
#!/bin/bash +set -euo pipefail +IFS=$'\n\t' + +# Ensure background tasks are cleaned up on exit +cleanup() { + srun -l --container-name=${container_name} --container-mounts=${mounts} --mpi=pmix --overlap \ + kill -9 $(ps aux | grep '[t]rtllm-serve' | awk '{print $2}') >/dev/null 2>&1 || true +} +trap cleanup EXITOptionally add --kill-on-bad-exit=1 to critical srun steps (YAML gen, workers, server, benchmark) so failures propagate.
128-150
: Optional: Unify GPU memory‐fraction flag names for clarityI verified that in examples/disaggregated/slurm/benchmark/gen_yaml.py you’ve defined:
--ctx_free_gpu_memory_fraction
(default 0.75)--gen_gpu_memory_fraction
(default 0.8)and your SLURM script correctly invokes those exact flags. However, the asymmetry—one flag containing “free” and the other not—can be confusing. Consider a one-time refactor to align them:
• Option A: Rename the context flag to
--ctx_gpu_memory_fraction
• Option B: Rename the generation flag to--gen_free_gpu_memory_fraction
Then update both the
add_argument
calls in gen_yaml.py and the corresponding--ctx_*
/--gen_*
invocations in disaggr_torch.slurm so they match.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
examples/disaggregated/slurm/benchmark/disaggr_torch.slurm
(3 hunks)
🔇 Additional comments (1)
examples/disaggregated/slurm/benchmark/disaggr_torch.slurm (1)
79-79
: Container name standardization LGTMRenaming to disaggregated_serving improves clarity and aligns with the split server/worker flow. No action needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
examples/disaggregated/slurm/benchmark/disaggr_torch.slurm (1)
1-1
: Add strict bash options to fail fast and preserve upstream exit codesPipelines like "python3 … | tee …" will otherwise hide failures because tee’s status wins. Enable strict mode right after the shebang.
#!/bin/bash +set -Eeuo pipefail
♻️ Duplicate comments (6)
examples/disaggregated/slurm/benchmark/disaggr_torch.slurm (6)
111-113
: nsys variable mismatch — standardize on nsys_folder (path), not nsys_on (flag)The worker launch passes a path-like last argument; keep the name semantic. Align the variable and comment accordingly. This also addresses a prior review.
-nsys_on="" -# nsys_on=${full_logdir} # Uncomment this line to enable Nsys profiling +nsys_folder="" +# nsys_folder=${full_logdir}/nsys # Uncomment this line to enable Nsys profiling; profiles will be written here
128-150
: YAML generation: good structure; verify arg names and capture errors
- The call is clear and well-logged. Two asks:
- Confirm gen_yaml.py expects --ctx_free_gpu_memory_fraction (ctx) but --gen_gpu_memory_fraction (gen). If both sides expect “free” or “gpu”, make the naming consistent.
- With set -o pipefail added, python failures will now propagate despite tee, which is what we want.
Also, answering an earlier question: mtp_size is used both in the logdir naming and here as --mtp_size, so it is not dead.
177-193
: GEN worker srun: invalid flag, and undefined/mismatched variables in command
- --segment is not a valid srun option.
- work_dir/model_path/nsys_on are undefined or mismatched with earlier variables (workdir/model_dir/nsys_folder).
- Specifying both --ntasks and --ntasks-per-node can overconstrain placement when gen_tp_size is not a multiple of gpus_per_node.
Minimal, robust fix:
srun -l -N ${gen_nodes_num} \ - --ntasks=${gen_tp_size} \ - --ntasks-per-node=${gpus_per_node} \ - --segment=${gen_nodes_num} \ + --ntasks=${gen_tp_size} \ --container-image=${container_image} \ --container-name=${container_name} \ --container-mounts=${mounts} \ --nodelist=$(IFS=,; echo "${node_list[*]}") \ --mpi=pmix \ - bash ${work_dir}/start_worker.sh "GEN" ${i} ${model_path} "8336" ${benchmark_mode} ${concurrency} ${enable_pdl} ${full_logdir} ${nsys_on} \ + bash ${workdir}/start_worker.sh "GEN" ${i} ${model_dir} "8336" ${benchmark_mode} ${concurrency} ${enable_pdl} ${full_logdir} ${nsys_folder} \ &> ${full_logdir}/output_gen_${i}.log &Optionally, if you must constrain per-node tasks, compute a remainder-aware layout and launch per-node sruns; otherwise, rely on Slurm to distribute -n across -N.
195-213
: CTX worker srun: same issues as GEN blockMirror the GEN fixes: drop --segment, align variable names, and avoid overconstraining placement unless you ensure divisibility.
srun -l -N ${ctx_nodes_num} \ - --ntasks=${ctx_tp_size} \ - --ntasks-per-node=${gpus_per_node} \ - --segment=${ctx_nodes_num} \ + --ntasks=${ctx_tp_size} \ --container-image=${container_image} \ --container-name=${container_name} \ --container-mounts=${mounts} \ --nodelist=$(IFS=,; echo "${node_list[*]}") \ --mpi=pmix \ - bash ${work_dir}/start_worker.sh "CTX" ${i} ${model_path} "8336" ${benchmark_mode} ${concurrency} ${enable_pdl} ${full_logdir} ${nsys_on} \ + bash ${workdir}/start_worker.sh "CTX" ${i} ${model_dir} "8336" ${benchmark_mode} ${concurrency} ${enable_pdl} ${full_logdir} ${nsys_folder} \ &> ${full_logdir}/output_ctx_${i}.log &
218-223
: Server start: work_dir vs workdir mismatch breaks argv to start_server.shThe last argument is ${work_dir}, which is undefined. Use ${workdir} to keep consistency. This was flagged previously as well.
srun -l --container-name=${container_name} \ --container-image=${container_image} \ --container-mounts=${mounts} \ --mpi=pmix --overlap -N 1 -n 1 \ - bash ${workdir}/start_server.sh ${num_ctx_servers} ${num_gen_servers} ${full_logdir} ${work_dir} \ + bash ${workdir}/start_server.sh ${num_ctx_servers} ${num_gen_servers} ${full_logdir} ${workdir} \ &> ${full_logdir}/output_server.log &
154-156
: gpus_per_node is undefined; node count math will failYou compute ctx_nodes_num/gen_nodes_num using gpus_per_node, but gpus_per_node is not set anywhere. Parse it from Slurm env (SLURM_TASKS_PER_NODE/SLURM_NTASKS_PER_NODE) with sane fallbacks.
-ctx_nodes_num=$(((ctx_tp_size + gpus_per_node - 1) / gpus_per_node)) -gen_nodes_num=$(((gen_tp_size + gpus_per_node - 1) / gpus_per_node)) +# +# Derive tasks-per-node (used here as GPUs-per-node) from Slurm env. +# Handles forms like "4(x2),3" by taking the first numeric group. +gpus_per_node="${SLURM_TASKS_PER_NODE:-${SLURM_NTASKS_PER_NODE:-}}" +gpus_per_node="${gpus_per_node%%,*}" # take first segment +gpus_per_node="${gpus_per_node%%(*}" # strip (xN) +if ! [[ "${gpus_per_node}" =~ ^[0-9]+$ ]]; then gpus_per_node=1; fi +echo "gpus_per_node: ${gpus_per_node}" + +ctx_nodes_num=$(((ctx_tp_size + gpus_per_node - 1) / gpus_per_node)) +gen_nodes_num=$(((gen_tp_size + gpus_per_node - 1) / gpus_per_node))
🧹 Nitpick comments (4)
examples/disaggregated/slurm/benchmark/disaggr_torch.slurm (4)
79-79
: Container name reuse across steps — verify plugin semanticsReusing a single container_name for GEN/CTX/server steps is fine if your Slurm container plugin (e.g., pyxis/enroot) guarantees safe reuse and isolation per job step. If reuse is not guaranteed, consider per-role names (e.g., disaggregated_serving_{gen,ctx,server}) to avoid cross-step side effects.
173-176
: Leftover hostnames cleanup without producerYou remove ${full_logdir}/hostnames but the script never generates it anymore. Safe to keep, but consider dropping the cleanup to avoid confusion.
214-214
: pid_list is accumulated but never usedIf you don’t intend to wait on or manage the background pids explicitly, drop the echo and the accumulation to reduce noise. Alternatively, add a final wait on those to ensure orderly shutdown.
82-82
: Minor: standardize logdir token for tensor parallel sizeYou use "dep" in one path and "tep" in the other; most places use "tp". Standardize to "tp" for grep-ability.
Example (not an exact diff here as these lines weren’t changed in this patch):
- ctx…_gen…_dep${gen_tp_size}… → ctx…_gen…_tp${gen_tp_size}…
- ctx…_gen…_tep${gen_tp_size}… → ctx…_gen…_tp${gen_tp_size}…
Also applies to: 95-95
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
examples/disaggregated/slurm/benchmark/disaggr_torch.slurm
(3 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (4)
examples/disaggregated/slurm/benchmark/disaggr_torch.slurm (4)
113-115
: Fix Nsys variable mismatch; standardize on nsys_folder and pass it to workers.Workers should receive a folder path; current var name (nsys_on) is ambiguous and inconsistent with earlier guidance.
-nsys_on="" -# nsys_on=${full_logdir} # Uncomment this line to enable Nsys profiling +nsys_folder="" +# nsys_folder=${full_logdir}/nsys # Uncomment to enable Nsys profiling; profiles will be saved here- bash ${workdir}/start_worker.sh "GEN" ${i} ${model_dir} "8336" ${benchmark_mode} ${concurrency} ${enable_pdl} ${full_logdir} ${nsys_on} \ + bash ${workdir}/start_worker.sh "GEN" ${i} ${model_dir} "8336" ${benchmark_mode} ${concurrency} ${enable_pdl} ${full_logdir} ${nsys_folder} \- bash ${workdir}/start_worker.sh "CTX" ${i} ${model_dir} "8336" ${benchmark_mode} ${concurrency} ${enable_pdl} ${full_logdir} ${nsys_on} \ + bash ${workdir}/start_worker.sh "CTX" ${i} ${model_dir} "8336" ${benchmark_mode} ${concurrency} ${enable_pdl} ${full_logdir} ${nsys_folder} \Also applies to: 192-193, 211-213
156-171
: Bound ctx_nodes slice and add capacity guard to prevent over-requesting nodes.Current slicing takes “the rest” of all nodes; also no check against allocation size.
all_nodes=($(scontrol show hostname $SLURM_NODELIST | sort)) total_nodes_num=${#all_nodes[@]} echo "all_nodes: ${all_nodes[@]}, total_nodes_num: ${total_nodes_num}" # get the node list for the gen workers total_gen_nodes_num=$((gen_nodes_num * num_gen_servers)) gen_nodes=(${all_nodes[@]:0:${total_gen_nodes_num}}) echo "gen_nodes: ${gen_nodes[@]}, total_gen_nodes_num: ${total_gen_nodes_num}" # get the node list for the ctx workers total_ctx_nodes_num=$((ctx_nodes_num * num_ctx_servers)) -ctx_nodes=(${all_nodes[@]:${total_gen_nodes_num}:${total_nodes_num}}) +if (( total_gen_nodes_num + total_ctx_nodes_num > total_nodes_num )); then + echo "Error: Requested nodes (gen:${total_gen_nodes_num} + ctx:${total_ctx_nodes_num}) exceed allocated nodes (${total_nodes_num})." >&2 + exit 1 +fi +ctx_nodes=(${all_nodes[@]:${total_gen_nodes_num}:${total_ctx_nodes_num}}) echo "ctx_nodes: ${ctx_nodes[@]}, total_ctx_nodes_num: ${total_ctx_nodes_num}"
183-193
: Remove invalid srun option --segment in GEN worker launch.
--segment
is not a valid srun flag; leaving it will cause runtime failures.srun -l -N ${gen_nodes_num} \ --ntasks=${gen_tp_size} \ - --ntasks-per-node=${gpus_per_node} \ - --segment=${gen_nodes_num} \ + --ntasks-per-node=${gpus_per_node} \ --container-image=${container_image} \ --container-name=${container_name} \ --container-mounts=${mounts} \ --nodelist=$(IFS=,; echo "${node_list[*]}") \ --mpi=pmix \
202-213
: Remove invalid srun option --segment in CTX worker launch.Same issue as GEN block; this will fail at runtime.
srun -l -N ${ctx_nodes_num} \ --ntasks=${ctx_tp_size} \ - --ntasks-per-node=${gpus_per_node} \ - --segment=${ctx_nodes_num} \ + --ntasks-per-node=${gpus_per_node} \ --container-image=${container_image} \ --container-name=${container_name} \ --container-mounts=${mounts} \ --nodelist=$(IFS=,; echo "${node_list[*]}") \ --mpi=pmix \
🧹 Nitpick comments (5)
examples/disaggregated/slurm/benchmark/disaggr_torch.slurm (5)
45-45
: Don’t hardcode gpus_per_node; derive from SLURM env (with sane fallback).Hardcoding 4 risks mismatch with the actual allocation and leads to wrong node math.
-gpus_per_node=4 +# Derive GPUs/tasks per node from SLURM env (overrideable), default to 4 if unknown. +gpus_per_node="${GPUS_PER_NODE_OVERRIDE:-${SLURM_GPUS_ON_NODE:-${SLURM_NTASKS_PER_NODE:-4}}}" +echo "gpus_per_node: ${gpus_per_node}"Run-time check (optional): validate that ctx/gen nodes computed below match $SLURM_NNODES and fail early if not (see capacity guard added in a later comment).
130-152
: Pipeline safety: tee masks failures — enable pipefail and fail-fast.Without pipefail, a failing python in the YAML step won’t stop the job.
#!/bin/bash +set -euo pipefail
(With this in place, the
gen_yaml.py | tee
step will correctly propagate non‑zero exit codes.)Also applies to: 1-1
183-193
: Optional: avoid over-constraining distribution; let Slurm place tasks.If tp_size isn’t a multiple of gpus_per_node,
--ntasks-per-node=${gpus_per_node}
can over-constrain placement. It’s safer to omit it and rely on--ntasks
with--nodelist
.- --ntasks-per-node=${gpus_per_node} \
If you need 1 task per GPU, prefer
--gpus-per-task=1
(if GPUs are scheduled) and drop--ntasks-per-node
.Also applies to: 202-213
116-122
: Container warm-up step is redundant or too narrow.This creates a container on a single node only; subsequent steps will (re)create containers where needed. Either remove it or scale it to all nodes.
Option A (remove):
-srun -l --container-image=${container_image} \ - --container-name=${container_name} \ - --container-mounts=${mounts} \ - --mpi=pmix \ - echo "Container up."Option B (warm up on all nodes):
srun -l --container-image=${container_image} \ --container-name=${container_name} \ --container-mounts=${mounts} \ - --mpi=pmix \ - echo "Container up." + --mpi=pmix -N "${SLURM_NNODES}" --ntasks-per-node=1 \ + bash -lc 'echo "Container up on $(hostname)"'
232-237
: Kill step likely runs on one node only; scale it across all nodes.Ensure the cleanup runs everywhere to avoid orphaned servers/workers.
-srun -l --container-name=${container_name} \ - --container-mounts=${mounts} \ - --mpi=pmix --overlap \ - kill -9 $(ps aux | grep '[t]rtllm-serve' | awk '{print $2}') >/dev/null 2>&1 || true +srun -l --container-name=${container_name} \ + --container-mounts=${mounts} \ + --mpi=pmix --overlap -N "${SLURM_NNODES}" --ntasks-per-node=1 \ + bash -lc "pkill -9 -f '[t]rtllm-serve' || true"
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
examples/disaggregated/slurm/benchmark/disaggr_torch.slurm
(4 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (2)
examples/disaggregated/slurm/benchmark/disaggr_torch.slurm (2)
218-225
: LGTM: server start uses consistent workdir and container options.The earlier work_dir/workdir mismatch appears resolved, and adding container-image/mounts for the server step is correct.
2-4
: Verify total resources requested by steps don’t exceed SBATCH allocation.Given SBATCH has 2 nodes and 8 tasks, confirm that:
- total_gen_nodes_num + total_ctx_nodes_num <= $SLURM_NNODES
- Sum of concurrent worker step tasks does not exceed available CPUs/GPUs per node
If needed, add a guard before launching workers:
# After computing gen_nodes_num/ctx_nodes_num and slices: if (( total_gen_nodes_num + total_ctx_nodes_num != SLURM_NNODES )); then echo "Warning: Node slices (${total_gen_nodes_num}+${total_ctx_nodes_num}) differ from allocated nodes (${SLURM_NNODES})." fiAlso applies to: 156-171, 177-215
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 10
♻️ Duplicate comments (2)
examples/disaggregated/slurm/benchmark/disaggr_torch.slurm (2)
115-117
: Prefer an explicit nsys_folder and precreate it; current name suggests booleanThe variable name nsys_on reads as a boolean toggle but is passed as a path to workers. Use nsys_folder for clarity and create the directory when enabled. This aligns with earlier guidance.
-nsys_on="" -# nsys_on=${full_logdir} # Uncomment this line to enable Nsys profiling +nsys_folder="" +# nsys_folder=${full_logdir}/nsys # Uncomment to enable Nsys profiling +if [[ -n "${nsys_folder}" ]]; then + mkdir -p "${nsys_folder}" +fiAnd pass nsys_folder to workers:
- bash ${workdir}/start_worker.sh "GEN" ${i} ${model_dir} "8336" ${benchmark_mode} ${concurrency} ${enable_pdl} ${full_logdir} ${nsys_on} \ + bash ${workdir}/start_worker.sh "GEN" ${i} ${model_dir} "8336" ${benchmark_mode} ${concurrency} ${enable_pdl} ${full_logdir} ${nsys_folder} \- bash ${workdir}/start_worker.sh "CTX" ${i} ${model_dir} "8336" ${benchmark_mode} ${concurrency} ${enable_pdl} ${full_logdir} ${nsys_on} \ + bash ${workdir}/start_worker.sh "CTX" ${i} ${model_dir} "8336" ${benchmark_mode} ${concurrency} ${enable_pdl} ${full_logdir} ${nsys_folder} \
158-174
: Node slicing: add capacity guard and bound CTX slice to requested sizeCurrently ctx_nodes is assigned the remainder of all_nodes; if allocation > required, CTX will grab extra nodes; if allocation < required, you won’t fail early. Add a capacity check and slice exactly total_ctx_nodes_num.
all_nodes=($(scontrol show hostname $SLURM_NODELIST | sort)) total_nodes_num=${#all_nodes[@]} echo "all_nodes: ${all_nodes[@]}, total_nodes_num: ${total_nodes_num}" @@ total_gen_nodes_num=$((gen_nodes_num * num_gen_servers)) gen_nodes=(${all_nodes[@]:0:${total_gen_nodes_num}}) echo "gen_nodes: ${gen_nodes[@]}, total_gen_nodes_num: ${total_gen_nodes_num}" @@ total_ctx_nodes_num=$((ctx_nodes_num * num_ctx_servers)) -ctx_nodes=(${all_nodes[@]:${total_gen_nodes_num}:${total_nodes_num}}) +if (( total_gen_nodes_num + total_ctx_nodes_num > total_nodes_num )); then + echo "Error: requested nodes (gen:${total_gen_nodes_num} + ctx:${total_ctx_nodes_num}) exceed allocated nodes (${total_nodes_num})." >&2 + exit 1 +fi +ctx_nodes=(${all_nodes[@]:${total_gen_nodes_num}:${total_ctx_nodes_num}}) echo "ctx_nodes: ${ctx_nodes[@]}, total_ctx_nodes_num: ${total_ctx_nodes_num}"
🧹 Nitpick comments (17)
examples/disaggregated/slurm/benchmark/gen_server_config.py (4)
32-35
: Optional: create work_dir if missing instead of raisingCreating the directory (with
exist_ok=True
) reduces flakiness in orchestrations where the server side may race with the producer of the folder.- if not os.path.exists(args.work_dir): - raise ValueError(f"Work directory {args.work_dir} not found") + if not os.path.exists(args.work_dir): + os.makedirs(args.work_dir, exist_ok=True) + logger.info("Created work directory %s", args.work_dir)
65-66
: Prefer logging over prints for host listsMove these to INFO so they’re visible in SLURM output but can be silenced via
--log_level
.- print(f"ctx_hostnames: {ctx_hostnames}") - print(f"gen_hostnames: {gen_hostnames}") + logger.info("ctx_hostnames: %s", ctx_hostnames) + logger.info("gen_hostnames: %s", gen_hostnames)
68-71
: Nit: comment correctness and structured loggingYou get the hostname from the system, not the env. Also prefer logging over print for consistency.
- # get current hostname from env + # Get current hostname from the system hostname = socket.gethostname() - print(f"Current hostname: {hostname}") + logger.info("Current hostname: %s", hostname)
86-90
: Use safe YAML dump, explicit encoding, and a single log lineSafer dumper, stable key order, and consistent logging.
- with open(os.path.join(args.work_dir, "server_config.yaml"), "w") as f: - yaml.dump(server_config, f) - print( - f"Server config file {os.path.join(args.work_dir, 'server_config.yaml')} generated" - ) + out_path = os.path.join(args.work_dir, "server_config.yaml") + with open(out_path, "w", encoding="utf-8") as f: + yaml.safe_dump(server_config, f, sort_keys=False) + logger.info("Server config file %s generated", out_path)examples/disaggregated/slurm/benchmark/disaggr_torch.slurm (5)
95-100
: Echo line reports only gen fraction; include ctx to avoid confusionThe log prints enable_attention_dp for both CTX/GEN but only reports gen_gpu_memory_fraction. Consider echoing ctx_free_gpu_memory_fraction as well for symmetry.
-echo "enable_attention_dp: ${ctx_enable_attention_dp}, ${gen_enable_attention_dp}, gpu_memory_fraction: ${gen_gpu_memory_fraction}" +echo "enable_attention_dp: CTX=${ctx_enable_attention_dp}, GEN=${gen_enable_attention_dp}, "\ +"gpu_memory_fraction: CTX=${ctx_gpu_memory_fraction}, GEN=${gen_gpu_memory_fraction}"
132-155
: Fail fast and surface config generation errorsThe gen_worker_config.py stage is critical. Without set -euo pipefail, downstream SRUNs may proceed with partial/missing config. Add strict mode and a trap early; also check that expected artifacts are present.
#!/bin/bash +# Fail fast and surface line numbers on error +set -euo pipefail +trap 'echo "[ERROR] $(basename "$0") failed at line ${LINENO}" >&2' ERR @@ srun -l -N 1 -n 1 \ @@ 2>&1 | tee ${full_logdir}/gen_worker_config.log + +# Optional: verify expected outputs exist before continuing +if [[ ! -s "${full_logdir}/gen_config/ctx_config.yaml" || ! -s "${full_logdir}/gen_config/gen_config.yaml" ]]; then + echo "Missing worker configs under ${full_logdir}/gen_config. See gen_worker_config.log." >&2 + exit 1 +fi
229-233
: Optional: isolate benchmark resource useThe benchmark srun uses the shared container and may contend with servers for CPU cores. Consider pinning CPUs or using --hint/--cpus-per-task if your cluster enforces CPU binding to avoid noisy-neighbor effects in results.
Example:
- --mpi=pmix --overlap -N 1 -n 1 \ + --mpi=pmix --overlap -N 1 -n 1 --cpus-per-task=4 \
234-240
: Shutdown sequence OK; consider a gentler stop before SIGKILLkill -9 is effective but skips cleanup. If start_server.sh or workers trap SIGTERM, a brief graceful phase can improve log flush and cleanup.
- kill -9 $(ps aux | grep '[t]rtllm-serve' | awk '{print $2}') >/dev/null 2>&1 || true + pids=$(ps aux | grep '[t]rtllm-serve' | awk '{print $2}') || true + [[ -n "${pids}" ]] && kill ${pids} >/dev/null 2>&1 || true + sleep 2 + [[ -n "${pids}" ]] && kill -9 ${pids} >/dev/null 2>&1 || true
45-47
: Optional Refactor: Robust SLURM tasks-per-node parsing
- In a non-Slurm environment both SLURM_NTASKS_PER_NODE and SLURM_TASKS_PER_NODE are unset (empty), so the proposed snippet correctly falls back to the default of 4.
- The
sed -E 's/^([0-9]+).*/\1/'
will extract “4” from strings like4(x2),2
, covering heterogeneous allocations.- Applying this diff will make
ntasks_per_node
derivation resilient across Slurm formats while preserving your existing default:-# Get GPUs per node dynamically from SLURM -ntasks_per_node=${SLURM_NTASKS_PER_NODE:-4} # Default to 4 for GB200 +# Get tasks-per-node (used as "gpus per node") robustly; fallback to 4 +ntasks_per_node="${SLURM_NTASKS_PER_NODE:-}" +if [[ -z "${ntasks_per_node}" ]]; then + # SLURM_TASKS_PER_NODE can look like "4(x2),2"; take the first integer + ntasks_per_node="$(echo "${SLURM_TASKS_PER_NODE:-}" | sed -E 's/^([0-9]+).*/\1/')" || true +fi +ntasks_per_node="${ntasks_per_node:-4}"examples/disaggregated/slurm/benchmark/gen_worker_config.py (8)
67-69
: Deduplicate and normalize CUDA graph batch sizes.If gen_batch_size equals one of the preset sizes (e.g., 256), the list contains duplicates. Some readers validate uniqueness.
Apply:
- gen_cuda_graph_batch_sizes = [ - 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 768, 1024, 2048, gen_batch_size - ] + gen_cuda_graph_batch_sizes = sorted({ + 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 768, 1024, 2048, gen_batch_size + })
71-76
: Streamline MOE backend selection to mutually exclusive branches.Current code is correct but non-idiomatic. Using if/elif/else clarifies precedence.
- gen_moe_backend = "CUTLASS" - if gen_tp_size >= 16 and gen_enable_attention_dp: - gen_moe_backend = "WIDEEP" - if not gen_enable_attention_dp: - gen_moe_backend = "TRTLLM" + if not gen_enable_attention_dp: + gen_moe_backend = "TRTLLM" + elif gen_tp_size >= 16: + gen_moe_backend = "WIDEEP" + else: + gen_moe_backend = "CUTLASS"
52-52
: Redundant bool conversions.
True if <bool> else False
is verbose; preferbool(...)
.- 'enable_attention_dp': True if ctx_enable_attention_dp else False, + 'enable_attention_dp': bool(ctx_enable_attention_dp),- 'enable_attention_dp': True if gen_enable_attention_dp else False, + 'enable_attention_dp': bool(gen_enable_attention_dp),Also applies to: 80-80
57-60
: Validate kv_cache_config dtype 'fp8' and consider parameterizing.Ensure 'fp8' is an accepted dtype for KV cache in current TRT-LLM. If not universally supported, consider exposing as a CLI arg with a safe default.
If parameterizing:
- 'dtype': 'fp8', + 'dtype': os.environ.get('TRTLLM_KV_DTYPE', 'fp8'),Or add a
--kv_cache_dtype
CLI argument used in both configs.Also applies to: 90-94
132-141
: Avoid shadowing the function name with a local variable and improve naming.Inside the function,
gen_config_file
(str path) shadows the function identifier. Rename to *_path for clarity.- ctx_config_file = os.path.join(work_dir, "ctx_config.yaml") - gen_config_file = os.path.join(work_dir, "gen_config.yaml") - with open(ctx_config_file, "w") as f: + ctx_config_path = os.path.join(work_dir, "ctx_config.yaml") + gen_config_path = os.path.join(work_dir, "gen_config.yaml") + with open(ctx_config_path, "w") as f: yaml.dump(ctx_config, f, default_flow_style=False, sort_keys=False) - with open(gen_config_file, "w") as f: + with open(gen_config_path, "w") as f: yaml.dump(gen_config, f, default_flow_style=False, sort_keys=False) - print( - f"ctx_config_file: {ctx_config_file} gen_config_file: {gen_config_file} generated successfully" - ) + print(f"ctx_config_file: {ctx_config_path} gen_config_file: {gen_config_path} generated successfully")
147-211
: Add basic CLI validation for positive ints and (0,1] fractions.Guardrail invalid inputs early to prevent obscure runtime behavior.
Minimal change: custom argparse type validators and reuse for relevant args.
import argparse import os import yaml +def _fraction01(value: str) -> float: + try: + x = float(value) + except ValueError as e: + raise argparse.ArgumentTypeError(f"Invalid float: {value}") from e + if not (0.0 < x <= 1.0): + raise argparse.ArgumentTypeError(f"Value must be in (0, 1], got {x}") + return x + +def _pos_int(value: str) -> int: + try: + x = int(value) + except ValueError as e: + raise argparse.ArgumentTypeError(f"Invalid int: {value}") from e + if x <= 0: + raise argparse.ArgumentTypeError(f"Value must be > 0, got {x}") + return x + parser = argparse.ArgumentParser() @@ - parser.add_argument("--ctx_batch_size", - type=int, + parser.add_argument("--ctx_batch_size", + type=_pos_int, default=1, help="Batch size for context servers") @@ - parser.add_argument("--ctx_max_num_tokens", - type=int, + parser.add_argument("--ctx_max_num_tokens", + type=_pos_int, default=8192, help="Max number of tokens for context servers") @@ - parser.add_argument("--ctx_max_seq_len", - type=int, + parser.add_argument("--ctx_max_seq_len", + type=_pos_int, default=8192, help="Max sequence length for context servers") @@ - parser.add_argument("--ctx_free_gpu_memory_fraction", - type=float, + parser.add_argument("--ctx_free_gpu_memory_fraction", + type=_fraction01, default=0.75, help="Free GPU memory fraction for context servers") @@ - parser.add_argument("--gen_batch_size", - type=int, + parser.add_argument("--gen_batch_size", + type=_pos_int, default=256, help="Batch size for generation servers") @@ - parser.add_argument("--gen_max_num_tokens", - type=int, + parser.add_argument("--gen_max_num_tokens", + type=_pos_int, default=256, help="Max number of tokens for generation servers") @@ - parser.add_argument("--gen_max_seq_len", - type=int, + parser.add_argument("--gen_max_seq_len", + type=_pos_int, default=9216, help="Max sequence length for generation servers") @@ - parser.add_argument("--gen_gpu_memory_fraction", - type=float, + parser.add_argument("--gen_gpu_memory_fraction", + type=_fraction01, default=0.8, help="GPU memory fraction for generation servers") @@ - parser.add_argument("--eplb_num_slots", - type=int, + parser.add_argument("--eplb_num_slots", + type=int, default=0, help="Number of slots for eplb") @@ - parser.add_argument("--mtp_size", - type=int, + parser.add_argument("--mtp_size", + type=int, default=0, help="Number of nextn layers for MTP") @@ - parser.add_argument("--cache_transceiver_max_num_tokens", - type=int, + parser.add_argument("--cache_transceiver_max_num_tokens", + type=_pos_int, default=8448, help="Max number of tokens for cache transceiver")
135-137
: Prefer yaml.safe_dump for emitting plain data.Safer default and consistent with best practices when not relying on Python object tags.
- yaml.dump(ctx_config, f, default_flow_style=False, sort_keys=False) + yaml.safe_dump(ctx_config, f, default_flow_style=False, sort_keys=False) @@ - yaml.dump(gen_config, f, default_flow_style=False, sort_keys=False) + yaml.safe_dump(gen_config, f, default_flow_style=False, sort_keys=False)
139-141
: Consider returning output paths instead of printing.Returning
(ctx_config_path, gen_config_path)
makes this utility easier to test and compose; the CLI can still print them.I can wire this through and update start_worker.sh accordingly if you’d like.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (4)
examples/disaggregated/slurm/benchmark/disaggr_torch.slurm
(5 hunks)examples/disaggregated/slurm/benchmark/gen_server_config.py
(1 hunks)examples/disaggregated/slurm/benchmark/gen_worker_config.py
(1 hunks)examples/disaggregated/slurm/benchmark/start_server.sh
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- examples/disaggregated/slurm/benchmark/start_server.sh
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py
: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else
Files:
examples/disaggregated/slurm/benchmark/gen_server_config.py
examples/disaggregated/slurm/benchmark/gen_worker_config.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
examples/disaggregated/slurm/benchmark/gen_server_config.py
examples/disaggregated/slurm/benchmark/gen_worker_config.py
🪛 Ruff (0.12.2)
examples/disaggregated/slurm/benchmark/gen_server_config.py
47-47: Line too long (168 > 120)
(E501)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (6)
examples/disaggregated/slurm/benchmark/gen_server_config.py (2)
72-84
: Server config structure looks goodThe schema is straightforward and maps cleanly to the disaggregated setup (separate CTX/GEN groups with per-host URLs). With the consistency checks above, this should be robust.
72-84
: Ensurenum_instances
always matches the URL lists
Derive the instance counts directly from the discovered hostnames to prevent any mismatch if the CLI‐provided number and the actual host list ever diverge. Please verify that downstream consumers read bothnum_instances
and theurls
list consistently.'context_servers': { - 'num_instances': args.num_ctx_servers, + 'num_instances': len(ctx_hostnames), 'urls': [f'{host}:{args.worker_port}' for host in ctx_hostnames] }, 'generation_servers': { - 'num_instances': args.num_gen_servers, + 'num_instances': len(gen_hostnames), 'urls': [f'{host}:{args.worker_port}' for host in gen_hostnames] }examples/disaggregated/slurm/benchmark/disaggr_torch.slurm (2)
83-86
: LGTM: container identity and log layoutThe container_name change and the new per-job log directory structure look good and improve discoverability.
221-227
: Server launch looks consistent nowworkdir is passed (not work_dir), which matches the rest of the script, and the container reuse is coherent. No issues spotted.
examples/disaggregated/slurm/benchmark/gen_worker_config.py (2)
105-107
: No action required: “MNNVL” is a valid allreduce_strategyWe’ve confirmed that downstream consumers recognize the
"MNNVL"
value in multiple places:
- In the LLM API arguments literal type:
tensorrt_llm/llmapi/llm_args.py
includes'MNNVL'
among valid strategies.- In the Python enum:
tensorrt_llm/functional.py
definesMNNVL = 7
.- In the model config mapping:
tensorrt_llm/_torch/model_config.py
maps the string"MNNVL"
toAllReduceStrategy.MNNVL
.- In the C++ kernels:
cpp/tensorrt_llm/kernels/customAllReduceKernels.h
listsMNNVL
as an option, and relatedmnnvl_twoshot_allreduce
kernels are registered.- In tests and documentation: multiple unit tests parametrize on
AllReduceStrategy.MNNVL
and docs underdocs/source/advanced/
and release notes list"MNNVL"
as a supported strategy.Since
"MNNVL"
is spelled correctly and fully supported, no changes are needed here.
108-121
: Confirm file-path usage is correct
Thevalidate_load_balancer
method intensorrt_llm/llmapi/llm_args.py
(around lines 2243–2252) explicitly checks for a string, treats it as a path to a YAML file, loads it withyaml.safe_load
, and constructs aMoeLoadBalancerConfig
instance. Downstream,moe_config.load_balancer
is passed as that object into the engine config, so writing out a separate YAML and assigning its path is exactly the intended pattern. No changes required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (6)
examples/disaggregated/slurm/benchmark/README.md (6)
14-14
: Minor grammar and article usage.Add an article before “disaggregated server”.
-6. `start_server.sh`: A shell script responsible for starting disaggregated server using `trtllm-serve` on each allocated machine. +6. `start_server.sh`: A shell script responsible for starting the disaggregated server using `trtllm-serve` on each allocated machine.If you adopt the broader clarification in my other comment for Lines 10–16, this change will be superseded.
73-82
: Arg list is clear; consider adding types, defaults, and constraints.Small doc enhancement: specify accepted values (true|false), integer ranges, defaults, and whether work_dir must be shared across nodes.
Proposed tweaks:
-1. `worker_type`: Either "CTX" or "GEN" to specify the worker type. +1. `worker_type` (string): "CTX" or "GEN". -2. `worker_index`: Index of the worker instance. +2. `worker_index` (int ≥ 0): Index of the worker instance. -4. `worker_port`: Port for the worker to listen on. +4. `worker_port` (int, 1–65535): Port for the worker to listen on. -7. `enable_pdl`: `true` or `false`. +7. `enable_pdl` (bool): `true` or `false`. Default: `false` (if applicable). -8. `work_dir`: Work directory for logs and configuration. +8. `work_dir` (path): Work directory for logs and configuration (must be accessible to the node running this worker). -9. `nsys_on`: Whether to enable nsys profiling. +9. `nsys_on` (bool): Whether to enable `nsys` profiling. Default: `false` (if applicable).Please align these with actual script defaults. If helpful, I can extract them and update the text.
83-92
: Server section: add where logs/config are written and any ports/URLs to expect.A brief note about the server’s log path, the default bind address/port, and the exact config filename will help users automate health checks and log collection.
Possible augmentation:
-This script starts the `trtllm-serve disaggregated` server. It first generates the server configuration using `gen_server_config.py`, then starts the server process. +This script starts the `trtllm-serve disaggregated` server. It first generates the server configuration using `gen_server_config.py`, then starts the server process. +The server writes logs under `${work_dir}/server/` and binds to the configured host:port (commonly 0.0.0.0:<port> unless overridden).If this isn’t accurate, replace with the correct defaults.
96-96
: Health-check behavior: mention retry policy/timeouts.Since this script gates benchmark execution, documenting retry interval and max wait time will prevent confusion on slow startups.
Example addition:
-This script orchestrates the execution of the benchmark client. It waits for the configuration files to be created and for the server's `/health` endpoint to respond, then it runs the benchmark. +This script orchestrates the benchmark client. It waits for the configuration files and for the server's `/health` endpoint to respond (poll interval: <X>s, timeout: <Y>m), then runs the benchmark.Please replace / with actual values.
51-58
: Improve README with concrete outputs and usage exampleThe README should clearly document which files are emitted, their names, and how to invoke the script with the correct flag names. Here’s a suggested update:
### `gen_worker_config.py` -This Python script generates the worker configuration YAML file that configures the `trtllm-serve` workers. It creates separate configurations for context and generation workers with different tensor parallelism, batch sizes, and other parameters. +This Python script generates worker configuration YAML files for `trtllm-serve` workers, splitting settings for context (CTX) and generation (GEN) phases. When using the eplb load balancer, it also outputs a separate slots configuration. +**Outputs:** +- `ctx_config.yaml` — CTX worker configuration (written under `<WORK_DIR>/ctx_config.yaml`) +- `gen_config.yaml` — GEN worker configuration (written under `<WORK_DIR>/gen_config.yaml`) +- *(Optional)* `moe_load_balancer.yaml` — load-balancer slots config when `--eplb_num_slots > 0` **Usage:** The script is called from within `disaggr_torch.slurm`, but can be run directly: ```bash python examples/disaggregated/slurm/benchmark/gen_worker_config.py \ --work_dir <WORK_DIR> \ --ctx_tp_size <CTX_TP> \ --ctx_batch_size <CTX_BATCH_SIZE> \ --ctx_max_num_tokens <CTX_MAX_TOKENS> \ --ctx_max_seq_len <CTX_MAX_SEQ_LEN> \ [--ctx_free_gpu_memory_fraction <FRACTION>] \ [--ctx_enable_attention_dp] \ --gen_tp_size <GEN_TP> \ --gen_batch_size <GEN_BATCH_SIZE> \ --gen_max_num_tokens <GEN_MAX_TOKENS> \ --gen_max_seq_len <GEN_MAX_SEQ_LEN> \ [--gen_gpu_memory_fraction <FRACTION>] \ [--gen_enable_attention_dp] \ [--eplb_num_slots <NUM_SLOTS>] \ [--mtp_size <MTP_SIZE>] \ [--cache_transceiver_max_num_tokens <NUM_TOKENS>]On success, it prints:
ctx_config_file: <WORK_DIR>/ctx_config.yaml gen_config_file: <WORK_DIR>/gen_config.yaml generated successfully
- Correct flag names to use underscores (`--ctx_tp_size`, not `--ctx-tp`) - Call out all generated filenames, including `moe_load_balancer.yaml` - Include the script’s final print output so users know where to look for the files --- `59-66`: **Clarify host discovery logic and output filename in `gen_server_config.py` docs** Please update `examples/disaggregated/slurm/benchmark/README.md` (around lines 59–66) to explicitly describe how host lists are discovered and name the generated config file. For example: ```diff -### `gen_server_config.py` - -This Python script generates the server configuration YAML file that configures the `trtllm-serve` disaggregated server. It reads hostname information from the work directory and creates a configuration that specifies the URLs for context and generation servers. - -**Usage:** - -The script is called from within `start_server.sh`. It takes arguments for the number of context and generation servers and the work directory. +### `gen_server_config.py` + +This Python script generates the `server_config.yaml` for the `trtllm-serve disaggregated` server. +It discovers worker hosts by reading all files in the `hostnames` subdirectory of your work directory: + - Waits for `HOST_DIR/hostnames` to appear under `--work_dir` + - Expects exactly `<num_ctx_servers> + <num_gen_servers>` files + - Files prefixed `CTX` are treated as context servers; `GEN` as generation servers + - Each file must contain a single hostname (e.g., `ctx-node-01`) + +Once all host files are present, it builds a YAML config with: +```yaml +hostname: <current-machine-hostname> +port: <--server_port> +backend: pytorch +context_servers: + num_instances: <--num_ctx_servers> + urls: ["<CTX1>:<--worker_port>", ...] +generation_servers: + num_instances: <--num_gen_servers> + urls: ["<GEN1>:<--worker_port>", ...] +``` +and writes it to: + +``` +<WORK_DIR>/server_config.yaml +``` + +**Usage example:** +```bash +python examples/disaggregated/slurm/benchmark/gen_server_config.py \ + --work_dir "${WORK_DIR}" \ + --num_ctx_servers "${NUM_CTX}" \ + --num_gen_servers "${NUM_GEN}" \ + --worker_port 8336 \ + --server_port 8333 +``` + +This makes the host‐list requirements and output file explicit and reproducible outside SLURM.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
examples/disaggregated/slurm/benchmark/README.md
(3 hunks)
🧰 Additional context used
🪛 LanguageTool
examples/disaggregated/slurm/benchmark/README.md
[grammar] ~14-~14: There might be a mistake here.
Context: ...ver.sh: A shell script responsible for starting disaggregated server using
trtllm-serv...
(QB_NEW_EN)
[grammar] ~14-~14: There might be a mistake here.
Context: ...trtllm-serveon each allocated machine. 7.
run_benchmark.sh`: A shell script that waits for the serv...
(QB_NEW_EN)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (3)
examples/disaggregated/slurm/benchmark/README.md (3)
7-7
: Good high-level framing.The added overview sentence helps set expectations for the new multi-script orchestration. No issues.
11-13
: Nice split between worker and server config generators.The doc now clearly separates worker and server responsibilities, aligning with recommended usage. This will help users customize CTX/GEN independently.
10-16
: Clarifystart_server.sh
invocation in READMEWe’ve confirmed that in
disaggr_torch.slurm
,start_server.sh
is launched via
srun … -N 1 -n 1 bash start_server.sh…
, i.e. a single server task on one node—not on every allocated machine. Please update the README to reflect this.• File:
examples/disaggregated/slurm/benchmark/README.md
• Lines: 10–16, item 6-6. `start_server.sh`: A shell script responsible for starting disaggregated server using `trtllm-serve` on each allocated machine. +6. `start_server.sh`: A shell script that starts the disaggregated server using `trtllm-serve` on a designated server node (launched with `srun -N 1 -n 1`), after generating the server configuration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
examples/disaggregated/slurm/benchmark/run_benchmark.sh (1)
62-76
: Fix health-check loop: current condition never evaluates HTTP status; may report healthy prematurelyThe loop negates curl’s exit code, not the HTTP status. With -w, curl prints status but still exits 0 for 200/404 alike unless --fail is used. Use --fail and test exit status, or capture and compare the code.
Minimal, robust fix:
-timeout=1800 -start_time=$(date +%s) -while ! curl -s -o /dev/null -w "%{http_code}" http://${hostname}:${port}/health; do +timeout=1800 +start_time=$(date +%s) +while ! curl -sfS -o /dev/null "http://${hostname}:${port}/health"; do current_time=$(date +%s) elapsed=$((current_time - start_time)) if [ $elapsed -ge $timeout ]; then echo "Error: Server is not healthy after ${timeout} seconds" exit 1 fi if [ $((elapsed % 30)) -eq 0 ]; then echo "Waiting for server to be healthy... (${elapsed}s elapsed)" fi sleep 10 done
♻️ Duplicate comments (7)
examples/disaggregated/slurm/benchmark/start_server.sh (1)
1-4
: Harden server launcher: add strict mode, usage check, quoting, and verify generated YAML before servingSame ask as earlier review: fail fast, validate args, and quote expansions. Also ensure the Python step succeeded by checking the YAML exists before starting the server.
Apply:
-#!/bin/bash -set -u -set -e -set -x +#!/usr/bin/env bash +set -euo pipefail +set -x +IFS=$'\n\t' -num_ctx_servers=$1 -num_gen_servers=$2 -work_dir=$3 -script_dir=$4 +if [ "$#" -lt 4 ]; then + echo "Usage: $0 <num_ctx_servers> <num_gen_servers> <work_dir> <script_dir>" >&2 + exit 2 +fi + +num_ctx_servers="$1" +num_gen_servers="$2" +work_dir="$3" +script_dir="$4" -python3 ${script_dir}/gen_server_config.py \ - --num_ctx_servers ${num_ctx_servers} \ - --num_gen_servers ${num_gen_servers} \ - --work_dir ${work_dir} -echo "server config generated to ${work_dir}/server_config.yaml" +python3 "${script_dir}/gen_server_config.py" \ + --num_ctx_servers "${num_ctx_servers}" \ + --num_gen_servers "${num_gen_servers}" \ + --work_dir "${work_dir}" +echo "server config generated to ${work_dir}/server_config.yaml" + +if [ ! -f "${work_dir}/server_config.yaml" ]; then + echo "Error: ${work_dir}/server_config.yaml not found; aborting." >&2 + exit 1 +fi -trtllm-serve disaggregated -c ${work_dir}/server_config.yaml -t 7200 -r 7200 +trtllm-serve disaggregated -c "${work_dir}/server_config.yaml" -t 7200 -r 7200Also applies to: 6-9, 11-17
examples/disaggregated/slurm/benchmark/start_worker.sh (3)
1-5
: Fix shebang and enable strict mode (pipefail + safe IFS)Reliable error propagation with pipelines and robust word-splitting protections.
-#! /bin/bash -set -u -set -e -set -x +#!/usr/bin/env bash +set -euo pipefail +set -x +IFS=$'\n\t'
31-40
: Resolve config_file first and verify it existsFail early if the derived YAML is missing.
if [ "${role}" = "CTX" ]; then config_file=${work_dir}/ctx_config.yaml elif [ "${role}" = "GEN" ]; then config_file=${work_dir}/gen_config.yaml else echo "Invalid role: ${role}" exit 1 fi echo "config_file: ${config_file}" +if [ ! -f "${config_file}" ]; then + echo "Config file not found: ${config_file}" >&2 + exit 1 +fi
16-19
: Critical: echoes unset config_file under set -u; script will abortReferencing ${config_file} before assignment with set -u causes immediate exit. Remove this echo or move after config resolution.
-unset UCX_TLS -echo "config_file: ${config_file}, concurrency: ${concurrency}, enable_pdl: ${enable_pdl}, work_dir: ${work_dir}" -echo "SLURM_PROCID: ${SLURM_PROCID}, hostname: $(hostname), instance_id: ${instance_id}" +unset UCX_TLS +echo "SLURM_PROCID: ${SLURM_PROCID}, hostname: $(hostname), instance_id: ${instance_id}, concurrency: ${concurrency}, enable_pdl: ${enable_pdl}, work_dir: ${work_dir}"examples/disaggregated/slurm/benchmark/disaggr_torch.slurm (3)
119-121
: Unify naming with start_worker.sh: prefer nsys_folder variablestart_worker.sh’s 9th arg is nsys_folder; using a different name here invites confusion. Standardize to nsys_folder and pass it through.
-nsys_on="" -# nsys_on=${full_logdir} # Uncomment this line to enable Nsys profiling +nsys_folder="" +# nsys_folder=${full_logdir}/nsys # Uncomment to enable NSYS; GEN workers will write profiles here @@ - bash ${workdir}/start_worker.sh "GEN" ${i} ${model_dir} "8336" ${benchmark_mode} ${concurrency} ${enable_pdl} ${full_logdir} ${nsys_on} \ + bash ${workdir}/start_worker.sh "GEN" ${i} ${model_dir} "8336" ${benchmark_mode} ${concurrency} ${enable_pdl} ${full_logdir} ${nsys_folder} \ @@ - bash ${workdir}/start_worker.sh "CTX" ${i} ${model_dir} "8336" ${benchmark_mode} ${concurrency} ${enable_pdl} ${full_logdir} ${nsys_on} \ + bash ${workdir}/start_worker.sh "CTX" ${i} ${model_dir} "8336" ${benchmark_mode} ${concurrency} ${enable_pdl} ${full_logdir} ${nsys_folder} \Also applies to: 198-199, 217-218
203-220
: Mirror GEN fixes for CTX loopDrop --segment, be cautious with --ntasks-per-node, and quote nodelist.
- srun -l -N ${ctx_nodes_num} \ - --ntasks=${ctx_tp_size} \ - --ntasks-per-node=${ntasks_per_node} \ - --segment=${ctx_nodes_num} \ + srun -l -N "${ctx_nodes_num}" \ + --ntasks="${ctx_tp_size}" \ --container-image=${container_image} \ --container-name=${container_name} \ --container-mounts=${mounts} \ - --nodelist=$(IFS=,; echo "${node_list[*]}") \ + --nodelist="$(IFS=,; echo "${node_list[*]}")" \ --mpi=pmix \ bash ${workdir}/start_worker.sh "CTX" ${i} ${model_dir} "8336" ${benchmark_mode} ${concurrency} ${enable_pdl} ${full_logdir} ${nsys_on} \ &> ${full_logdir}/output_ctx_${i}.log &
169-178
: Guard against node under-allocation and bound CTX slice to required sizeEnsure your requested GEN+CTX nodes fit the allocation, and restrict ctx_nodes to exactly total_ctx_nodes_num (not the remainder).
# get the node list for the gen workers total_gen_nodes_num=$((gen_nodes_num * num_gen_servers)) gen_nodes=(${all_nodes[@]:0:${total_gen_nodes_num}}) echo "gen_nodes: ${gen_nodes[@]}, total_gen_nodes_num: ${total_gen_nodes_num}" # get the node list for the ctx workers total_ctx_nodes_num=$((ctx_nodes_num * num_ctx_servers)) -ctx_nodes=(${all_nodes[@]:${total_gen_nodes_num}:${total_nodes_num}}) +if (( total_gen_nodes_num + total_ctx_nodes_num > total_nodes_num )); then + echo "Error: Requested nodes (gen:${total_gen_nodes_num} + ctx:${total_ctx_nodes_num}) exceed allocated nodes (${total_nodes_num})." >&2 + exit 1 +fi +ctx_nodes=(${all_nodes[@]:${total_gen_nodes_num}:${total_ctx_nodes_num}}) echo "ctx_nodes: ${ctx_nodes[@]}, total_ctx_nodes_num: ${total_ctx_nodes_num}"
🧹 Nitpick comments (6)
examples/disaggregated/slurm/benchmark/run_benchmark.sh (3)
80-97
: Make log harvesting resilient and shell-safe
- Use local vars to avoid global leakage.
- Enable nullglob to avoid iterating over literal globs when no files exist.
- Quote expansions.
-do_get_logs(){ - log_path=$1 - output_folder=$2 +do_get_logs(){ + local log_path="$1" + local output_folder="$2" + shopt -s nullglob @@ - for gen_file in ${log_path}/output_gen_*.log; do - if [ -f "$gen_file" ]; then - index=$(basename "$gen_file" | sed 's/output_gen_\(.*\)\.log/\1/') - grep -a "'num_ctx_requests': 0, 'num_ctx_tokens': 0" "$gen_file" > "${output_folder}/gen_only_${index}.txt" || true - fi + for gen_file in "${log_path}"/output_gen_*.log; do + if [ -f "$gen_file" ]; then + index="$(basename "$gen_file" | sed 's/output_gen_\(.*\)\.log/\1/')" + grep -a "'num_ctx_requests': 0, 'num_ctx_tokens': 0" "$gen_file" > "${output_folder}/gen_only_${index}.txt" || true + fi done @@ - for ctx_file in ${log_path}/output_ctx_*.log; do - if [ -f "$ctx_file" ]; then - index=$(basename "$ctx_file" | sed 's/output_ctx_\(.*\)\.log/\1/') - grep -a "'num_generation_tokens': 0" "$ctx_file" > "${output_folder}/ctx_only_${index}.txt" || true - fi + for ctx_file in "${log_path}"/output_ctx_*.log; do + if [ -f "$ctx_file" ]; then + index="$(basename "$ctx_file" | sed 's/output_ctx_\(.*\)\.log/\1/')" + grep -a "'num_generation_tokens': 0" "$ctx_file" > "${output_folder}/ctx_only_${index}.txt" || true + fi done } @@ - do_get_logs ${log_path} ${log_path}/concurrency_${concurrency} + do_get_logs "${log_path}" "${log_path}/concurrency_${concurrency}"Also applies to: 121-123
1-6
: Consider pipefail and safer trap for better diagnosticsWith pipelines and tee, set -o pipefail ensures failures propagate. The trap is good—retain it.
-set -e +set -e +set -o pipefail
55-61
: Minor hardening: quote URLs/paths and prefer graceful shutdown before SIGKILL
- Quote wget destination and URL.
- When shutting down, try TERM first, then KILL after a grace period.
-wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json -O ${shared_gpt_path} +wget "https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json" -O "${shared_gpt_path}" @@ -echo "Starting benchmark..." +echo "Starting benchmark..." @@ - mkdir -p ${log_path}/concurrency_${concurrency} + mkdir -p "${log_path}/concurrency_${concurrency}" @@ -echo "Benchmark done, gracefully shutting down server and workers..." -kill -9 $(ps aux | grep '[s]tart_server.sh' | awk '{print $2}') >/dev/null 2>&1 || true -kill -9 $(ps aux | grep '[s]tart_worker.sh' | awk '{print $2}') >/dev/null 2>&1 || true -kill -9 $(ps aux | grep '[t]rtllm-serve' | awk '{print $2}') >/dev/null 2>&1 || true -sleep 20 # Give processes some time to clean up +echo "Benchmark done, gracefully shutting down server and workers..." +pkill -f '[s]tart_server.sh' || true +pkill -f '[s]tart_worker.sh' || true +pkill -f '[t]rtllm-serve' || true +sleep 10 +# Force kill stragglers +pkill -9 -f '[s]tart_server.sh' || true +pkill -9 -f '[s]tart_worker.sh' || true +pkill -9 -f '[t]rtllm-serve' || true +sleep 5Also applies to: 100-121, 125-136
examples/disaggregated/slurm/benchmark/start_worker.sh (1)
43-48
: Quote hostname redirection and mkdir pathPrevents word splitting and globs.
-if [ "${SLURM_NODEID}" = "0" ]; then - mkdir -p ${work_dir}/hostnames/ - echo $(hostname) > ${work_dir}/hostnames/${role}_${instance_id}.txt - echo "hostname saved to ${work_dir}/hostnames/${role}_${instance_id}.txt" +if [ "${SLURM_NODEID}" = "0" ]; then + mkdir -p "${work_dir}/hostnames/" + echo "$(hostname)" > "${work_dir}/hostnames/${role}_${instance_id}.txt" + echo "hostname saved to ${work_dir}/hostnames/${role}_${instance_id}.txt" fiexamples/disaggregated/slurm/benchmark/disaggr_torch.slurm (2)
87-106
: Log path label typo: dep vs tp/tep inconsistenciesIn full_logdir formatting, one branch uses dep${gen_tp_size}, the other uses tep${gen_tp_size}. Likely a typo; choose a single label (e.g., tp) for consistency and discoverability.
-full_logdir=${logdir}/ctx${num_ctx_servers}_gen${num_gen_servers}_dep${gen_tp_size}_batch${gen_batch_size}_eplb${eplb_num_slots}_mtp${mtp_size} +full_logdir=${logdir}/ctx${num_ctx_servers}_gen${num_gen_servers}_tp${gen_tp_size}_batch${gen_batch_size}_eplb${eplb_num_slots}_mtp${mtp_size} @@ - full_logdir=${logdir}/ctx${num_ctx_servers}_gen${num_gen_servers}_tep${gen_tp_size}_batch${gen_batch_size}_eplb${eplb_num_slots}_mtp${mtp_size} + full_logdir=${logdir}/ctx${num_ctx_servers}_gen${num_gen_servers}_tp${gen_tp_size}_batch${gen_batch_size}_eplb${eplb_num_slots}_mtp${mtp_size}
181-184
: Quote expansions and paths across srun blocksGeneral shell safety: quote nodelist construction, directories, and script paths.
-rm -rf ${full_logdir}/hostnames +rm -rf "${full_logdir}/hostnames" @@ - echo "gen node_list index ${i}: ${node_list[@]}" + echo "gen node_list index ${i}: ${node_list[@]}" @@ - --container-mounts=${mounts} \ + --container-mounts="${mounts}" \ @@ - echo "ctx node_list index ${i}: ${node_list[@]}" + echo "ctx node_list index ${i}: ${node_list[@]}" @@ -echo "pid_list: ${pid_list}" +echo "pid_list: ${pid_list}" @@ - --container-image=${container_image} \ - --container-mounts=${mounts} \ + --container-image="${container_image}" \ + --container-mounts="${mounts}" \ @@ - bash ${workdir}/start_server.sh ${num_ctx_servers} ${num_gen_servers} ${full_logdir} ${workdir} \ + bash "${workdir}/start_server.sh" "${num_ctx_servers}" "${num_gen_servers}" "${full_logdir}" "${workdir}" \ @@ - --container-mounts=${mounts} \ + --container-mounts="${mounts}" \ @@ - bash ${workdir}/run_benchmark.sh ${isl} ${osl} ${multi_round} ${model_dir} "${concurrency}" ${streaming} ${full_logdir} \ + bash "${workdir}/run_benchmark.sh" "${isl}" "${osl}" "${multi_round}" "${model_dir}" "${concurrency}" "${streaming}" "${full_logdir}" \ @@ - kill -9 $(ps aux | grep '[t]rtllm-serve' | awk '{print $2}') >/dev/null 2>&1 || true + kill -9 $(ps aux | grep '[t]rtllm-serve' | awk '{print $2}') >/dev/null 2>&1 || trueAlso applies to: 189-197, 208-216, 222-231, 236-237, 239-244
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (4)
examples/disaggregated/slurm/benchmark/disaggr_torch.slurm
(6 hunks)examples/disaggregated/slurm/benchmark/run_benchmark.sh
(4 hunks)examples/disaggregated/slurm/benchmark/start_server.sh
(1 hunks)examples/disaggregated/slurm/benchmark/start_worker.sh
(2 hunks)
🧰 Additional context used
🪛 Shellcheck (0.10.0)
examples/disaggregated/slurm/benchmark/start_worker.sh
[warning] 46-46: Quote this to prevent word splitting.
(SC2046)
[warning] 53-53: Quote this to prevent word splitting.
(SC2046)
[warning] 61-61: Quotes/backslashes will be treated literally. Use an array.
(SC2089)
[warning] 66-66: Quotes/backslashes in this variable will not be respected.
(SC2090)
[warning] 68-68: Quote this to prevent word splitting.
(SC2046)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (1)
examples/disaggregated/slurm/benchmark/run_benchmark.sh (1)
28-48
: Quoteconfig_file
and anchor YAML key extraction inrun_benchmark.sh
The
server_config.yaml
produced bygen_server_config.py
has a flat, top-levelhostname:
andport:
key (followed by nested sections forcontext_servers
andgeneration_servers
that do not definehostname:
orport:
keys). We can safely extract those two values by:
- Quoting
${config_file}
everywhere to guard against spaces in$log_path
.- Using an anchored
awk
pattern to match only top-level keys and stop after the first hit.Apply the following diff to
examples/disaggregated/slurm/benchmark/run_benchmark.sh
(lines 28–54):-config_file=${log_path}/server_config.yaml +config_file="${log_path}/server_config.yaml" # check if the config file exists every 10 seconds timeout 1800 seconds timeout=1800 start_time=$(date +%s) -while [ ! -f ${config_file} ]; do +while [ ! -f "${config_file}" ]; do current_time=$(date +%s) elapsed=$((current_time - start_time)) @@ -# grep the host and port from the config file -hostname=$(grep -i "hostname:" ${config_file} | awk '{print $2}') -port=$(grep -i "port:" ${config_file} | awk '{print $2}') +# extract only top-level hostname and port, then exit after the first match +hostname=$(awk '/^[[:space:]]*hostname:[[:space:]]*/ {print $2; exit}' "${config_file}") +port=$(awk '/^[[:space:]]*port:[[:space:]]*/ {print $2; exit}' "${config_file}")With this change:
- Spaces in
$log_path
won’t break the script.- Only the first occurrence of each key at the start of a line (allowing leading whitespace) is captured, avoiding any accidental matches elsewhere.
No further changes are needed—anchored
awk
suffices given the current YAML shape.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (8)
examples/disaggregated/slurm/benchmark/start_worker.sh (4)
1-4
: Fix shebang and enable full strict mode (pipefail + safe IFS).Use a portable shebang and fail fast on pipeline errors; keeps debug tracing. This was raised earlier and still applies.
-#! /bin/bash -set -u -set -e -set -x +#!/usr/bin/env bash +set -euo pipefail +set -x +IFS=$'\n\t'
6-14
: Add argument-count validation with a clear usage message before reading $1..$9.With set -u, referencing missing positional args will exit with “unbound variable.” Guard early with a concise usage.
+if [ "$#" -lt 9 ]; then + echo "Usage: $0 <role:CTX|GEN> <instance_id> <model_path> <port> <benchmark_mode:e2e|gen_only> <concurrency> <enable_pdl:true|false> <work_dir> <nsys_folder|empty>" >&2 + exit 2 +fi + role=$1 instance_id=$2 model_path=$3 port=$4 benchmark_mode=$5 concurrency=$6 enable_pdl=$7 work_dir=$8 nsys_folder=${9:-}
31-40
: Validate config_file existence immediately after deriving it.Failing early prevents a confusing downstream launch error if the YAML wasn’t generated.
if [ "${role}" = "CTX" ]; then config_file=${work_dir}/ctx_config.yaml elif [ "${role}" = "GEN" ]; then config_file=${work_dir}/gen_config.yaml else echo "Invalid role: ${role}" exit 1 fi -echo "config_file: ${config_file}" +echo "config_file: ${config_file}" +if [ ! -f "${config_file}" ]; then + echo "Config file not found: ${config_file}" >&2 + exit 1 +fi
51-69
: Build NSYS prefix as an array; quote all args; fix SC2089/SC2090/SC2046.String-building a command breaks quoting; use an array and expand with "${arr[@]}". Also quote host/port/model/config paths. This was flagged previously and remains.
-if [ -z "${nsys_folder:-}" ]; then - echo "nsys is not enabled, start normal flow" - trtllm-llmapi-launch trtllm-serve ${model_path} --host $(hostname) --port ${port} --extra_llm_api_options ${config_file} +if [ -z "${nsys_folder:-}" ]; then + echo "nsys is not enabled, start normal flow" + trtllm-llmapi-launch trtllm-serve "${model_path}" --host "$(hostname)" --port "${port}" --extra_llm_api_options "${config_file}" else - nsys_prefix="" - nsys_file=${nsys_folder}/nsys_worker_proc_${instance_id}_${SLURM_PROCID} + nsys_args=() + nsys_file="${nsys_folder}/nsys_worker_proc_${instance_id}_${SLURM_PROCID}" export TLLM_PROFILE_RECORD_GC=1 export TLLM_NVTX_DEBUG=1 if [ "${role}" = "GEN" ]; then export TLLM_PROFILE_START_STOP=200-250 - nsys_prefix="nsys profile -e \"NSYS_MPI_STORE_TEAMS_PER_RANK=1\" -o ${nsys_file} -f true -t cuda,nvtx,python-gil -c cudaProfilerApi --cuda-graph-trace node --capture-range-end=stop --gpu-metrics-devices=none" - echo "nsys_prefix: ${nsys_prefix}" + nsys_args=(nsys profile -e NSYS_MPI_STORE_TEAMS_PER_RANK=1 -o "${nsys_file}" -f true -t cuda,nvtx,python-gil -c cudaProfilerApi --cuda-graph-trace node --capture-range-end=stop --gpu-metrics-devices=none) + echo "nsys_args: ${nsys_args[*]}" elif [ "${role}" = "CTX" ]; then echo "nsys is not enabled on ctx_gpus" fi - trtllm-llmapi-launch ${nsys_prefix} \ - trtllm-serve ${model_path} \ - --host $(hostname) --port ${port} \ - --extra_llm_api_options ${config_file} + trtllm-llmapi-launch "${nsys_args[@]}" \ + trtllm-serve "${model_path}" \ + --host "$(hostname)" --port "${port}" \ + --extra_llm_api_options "${config_file}" fiexamples/disaggregated/slurm/benchmark/disaggr_torch.slurm (4)
119-121
: Standardize on nsys_folder to match worker arg name.Avoids variable-name drift between orchestrator and worker scripts. This was noted previously.
-nsys_on="" -# nsys_on=${full_logdir} # Uncomment this line to enable Nsys profiling +nsys_folder="" +# nsys_folder=${full_logdir}/nsys # Uncomment to enable Nsys profiling; GEN workers will write profiles here - bash ${workdir}/start_worker.sh "GEN" ${i} ${model_dir} "8336" ${benchmark_mode} ${concurrency} ${enable_pdl} ${full_logdir} ${nsys_on} \ + bash ${workdir}/start_worker.sh "GEN" ${i} ${model_dir} "8336" ${benchmark_mode} ${concurrency} ${enable_pdl} ${full_logdir} ${nsys_folder} \ - bash ${workdir}/start_worker.sh "CTX" ${i} ${model_dir} "8336" ${benchmark_mode} ${concurrency} ${enable_pdl} ${full_logdir} ${nsys_on} \ + bash ${workdir}/start_worker.sh "CTX" ${i} ${model_dir} "8336" ${benchmark_mode} ${concurrency} ${enable_pdl} ${full_logdir} ${nsys_folder} \Also applies to: 198-199, 217-218
162-178
: Bound ctx_nodes slice and add capacity check against allocated nodes.Prevents out-of-range slices and misassignment when allocation is insufficient. Previously suggested; still needed.
all_nodes=($(scontrol show hostname $SLURM_NODELIST | sort)) total_nodes_num=${#all_nodes[@]} echo "all_nodes: ${all_nodes[@]}, total_nodes_num: ${total_nodes_num}" # get the node list for the gen workers total_gen_nodes_num=$((gen_nodes_num * num_gen_servers)) gen_nodes=(${all_nodes[@]:0:${total_gen_nodes_num}}) echo "gen_nodes: ${gen_nodes[@]}, total_gen_nodes_num: ${total_gen_nodes_num}" # get the node list for the ctx workers total_ctx_nodes_num=$((ctx_nodes_num * num_ctx_servers)) -ctx_nodes=(${all_nodes[@]:${total_gen_nodes_num}:${total_nodes_num}}) +if (( total_gen_nodes_num + total_ctx_nodes_num > total_nodes_num )); then + echo "Error: Requested nodes (gen:${total_gen_nodes_num} + ctx:${total_ctx_nodes_num}) exceed allocated nodes (${total_nodes_num})." >&2 + exit 1 +fi +ctx_nodes=(${all_nodes[@]:${total_gen_nodes_num}:${total_ctx_nodes_num}}) echo "ctx_nodes: ${ctx_nodes[@]}, total_ctx_nodes_num: ${total_ctx_nodes_num}"
189-197
: srun: remove invalid --segment and avoid overspecifying tasks-per-node; quote nodelist.--segment is not an srun option and will fail. Forcing --ntasks-per-node can overconstrain placement. Quote the nodelist so it’s one arg. This mirrors prior feedback.
- srun -l -N ${gen_nodes_num} \ - --ntasks=${gen_tp_size} \ - --ntasks-per-node=${ntasks_per_node} \ - --segment=${gen_nodes_num} \ + srun -l -N "${gen_nodes_num}" \ + --ntasks="${gen_tp_size}" \ --container-image=${container_image} \ --container-name=${container_name} \ --container-mounts=${mounts} \ - --nodelist=$(IFS=,; echo "${node_list[*]}") \ + --nodelist="$(IFS=,; echo "${node_list[*]}")" \ --mpi=pmix \ bash ${workdir}/start_worker.sh "GEN" ${i} ${model_dir} "8336" ${benchmark_mode} ${concurrency} ${enable_pdl} ${full_logdir} ${nsys_folder} \ &> ${full_logdir}/output_gen_${i}.log &Optionally, only set per-node tasks when divisible (place before the loop):
# compute optional arg if (( gen_tp_size % gen_nodes_num == 0 )); then gen_tasks_per_node=$(( gen_tp_size / gen_nodes_num )); gen_ntasks_per_node_arg=(--ntasks-per-node="${gen_tasks_per_node}"); else gen_ntasks_per_node_arg=(); fi # ... and add: "${gen_ntasks_per_node_arg[@]}"
208-216
: Mirror fix for CTX loop: remove --segment, be cautious with --ntasks-per-node, quote nodelist.Same runtime failure risk as GEN loop.
- srun -l -N ${ctx_nodes_num} \ - --ntasks=${ctx_tp_size} \ - --ntasks-per-node=${ntasks_per_node} \ - --segment=${ctx_nodes_num} \ + srun -l -N "${ctx_nodes_num}" \ + --ntasks="${ctx_tp_size}" \ --container-image=${container_image} \ --container-name=${container_name} \ --container-mounts=${mounts} \ - --nodelist=$(IFS=,; echo "${node_list[*]}") \ + --nodelist="$(IFS=,; echo "${node_list[*]}")" \ --mpi=pmix \ bash ${workdir}/start_worker.sh "CTX" ${i} ${model_dir} "8336" ${benchmark_mode} ${concurrency} ${enable_pdl} ${full_logdir} ${nsys_folder} \ &> ${full_logdir}/output_ctx_${i}.log &Optional divisible-only arg (before the loop):
if (( ctx_tp_size % ctx_nodes_num == 0 )); then ctx_tasks_per_node=$(( ctx_tp_size / ctx_nodes_num )); ctx_ntasks_per_node_arg=(--ntasks-per-node="${ctx_tasks_per_node}"); else ctx_ntasks_per_node_arg=(); fi # add: "${ctx_ntasks_per_node_arg[@]}"
🧹 Nitpick comments (5)
examples/disaggregated/slurm/benchmark/start_worker.sh (1)
17-19
: Quote all expansions and prefer printf for robust logging/IO.Addresses Shellcheck SC2046 and avoids surprises if paths or hostnames contain special chars.
-echo "concurrency: ${concurrency}, enable_pdl: ${enable_pdl}, work_dir: ${work_dir}" -echo "SLURM_PROCID: ${SLURM_PROCID}, hostname: $(hostname), instance_id: ${instance_id}" +echo "concurrency: ${concurrency}, enable_pdl: ${enable_pdl}, work_dir: ${work_dir}" +echo "SLURM_PROCID: ${SLURM_PROCID}, hostname: $(hostname), instance_id: ${instance_id}" -if [ "${SLURM_NODEID}" = "0" ]; then - mkdir -p ${work_dir}/hostnames/ - echo $(hostname) > ${work_dir}/hostnames/${role}_${instance_id}.txt - echo "hostname saved to ${work_dir}/hostnames/${role}_${instance_id}.txt" +if [ "${SLURM_NODEID}" = "0" ]; then + mkdir -p "${work_dir}/hostnames/" + printf '%s\n' "$(hostname)" > "${work_dir}/hostnames/${role}_${instance_id}.txt" + echo "hostname saved to ${work_dir}/hostnames/${role}_${instance_id}.txt" fiAlso applies to: 45-48
examples/disaggregated/slurm/benchmark/disaggr_torch.slurm (4)
10-13
: Enable full strict mode (pipefail + safe IFS).Improves failure detection in multi-stage pipelines; keeps xtrace for debugging.
-set -u -set -e -set -x +set -euo pipefail +set -x +IFS=$'\n\t'
114-117
: Quote trtllm_repo and optionally gate the git call.Avoids edge cases when trtllm_repo is empty or contains spaces; keeps stderr suppressed as you already do.
-if [ -z "${TRT_LLM_GIT_COMMIT:-}" ]; then - export TRT_LLM_GIT_COMMIT=$(git -C ${trtllm_repo} rev-parse --short HEAD 2>/dev/null || echo "unknown") +if [ -z "${TRT_LLM_GIT_COMMIT:-}" ]; then + export TRT_LLM_GIT_COMMIT="$(git -C "${trtllm_repo}" rev-parse --short HEAD 2>/dev/null || echo "unknown")" echo "TRT_LLM_GIT_COMMIT: ${TRT_LLM_GIT_COMMIT}" fi
136-158
: Quote Python CLI args when generating YAML.Safer if any values contain special chars; consistent with other quoted usages.
-srun -l -N 1 -n 1 \ +srun -l -N 1 -n 1 \ --container-name=${container_name} \ --container-mounts=${mounts} \ --mpi=pmix --overlap \ - python3 ${workdir}/gen_worker_config.py \ - --work_dir ${full_logdir} \ - --ctx_tp_size ${ctx_tp_size} \ - --ctx_batch_size ${ctx_batch_size} \ - --ctx_max_num_tokens ${ctx_max_num_tokens} \ - --ctx_max_seq_len ${ctx_max_seq_len} \ - --ctx_free_gpu_memory_fraction ${ctx_gpu_frac} \ - --gen_tp_size ${gen_tp_size} \ - --gen_batch_size ${gen_batch_size} \ - --gen_max_num_tokens ${gen_max_num_tokens} \ - --gen_max_seq_len ${gen_max_seq_len} \ - --gen_gpu_memory_fraction ${gen_gpu_memory_fraction} \ - --eplb_num_slots ${eplb_num_slots} \ - --mtp_size ${mtp_size} \ - --cache_transceiver_max_num_tokens ${cache_transceiver_max_num_tokens} \ + python3 "${workdir}/gen_worker_config.py" \ + --work_dir "${full_logdir}" \ + --ctx_tp_size "${ctx_tp_size}" \ + --ctx_batch_size "${ctx_batch_size}" \ + --ctx_max_num_tokens "${ctx_max_num_tokens}" \ + --ctx_max_seq_len "${ctx_max_seq_len}" \ + --ctx_free_gpu_memory_fraction "${ctx_gpu_frac}" \ + --gen_tp_size "${gen_tp_size}" \ + --gen_batch_size "${gen_batch_size}" \ + --gen_max_num_tokens "${gen_max_num_tokens}" \ + --gen_max_seq_len "${gen_max_seq_len}" \ + --gen_gpu_memory_fraction "${gen_gpu_memory_fraction}" \ + --eplb_num_slots "${eplb_num_slots}" \ + --mtp_size "${mtp_size}" \ + --cache_transceiver_max_num_tokens "${cache_transceiver_max_num_tokens}" \ $(if [ "${ctx_enable_attention_dp}" = "true" ]; then echo "--ctx_enable_attention_dp"; fi) \ $(if [ "${gen_enable_attention_dp}" = "true" ]; then echo "--gen_enable_attention_dp"; fi) \ 2>&1 | tee ${full_logdir}/gen_worker_config.log
232-237
: Quote all benchmark script args for consistency.Minor robustness nit; you already quote concurrency.
- bash ${workdir}/run_benchmark.sh ${isl} ${osl} ${multi_round} ${model_dir} "${concurrency}" ${streaming} ${full_logdir} \ + bash "${workdir}/run_benchmark.sh" "${isl}" "${osl}" "${multi_round}" "${model_dir}" "${concurrency}" "${streaming}" "${full_logdir}" \ &> ${full_logdir}/benchmark.log 2>&1
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
examples/disaggregated/slurm/benchmark/disaggr_torch.slurm
(6 hunks)examples/disaggregated/slurm/benchmark/start_worker.sh
(2 hunks)
🧰 Additional context used
🪛 Shellcheck (0.10.0)
examples/disaggregated/slurm/benchmark/start_worker.sh
[warning] 46-46: Quote this to prevent word splitting.
(SC2046)
[warning] 53-53: Quote this to prevent word splitting.
(SC2046)
[warning] 61-61: Quotes/backslashes will be treated literally. Use an array.
(SC2089)
[warning] 66-66: Quotes/backslashes in this variable will not be respected.
(SC2090)
[warning] 68-68: Quote this to prevent word splitting.
(SC2046)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
7b9a11c
to
85f2bb8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (12)
examples/disaggregated/slurm/benchmark/gen_server_config.py (3)
1-7
: Add NVIDIA 2025 header and switch to structured loggingThe file is missing the required NVIDIA copyright header and uses print statements. Prefer the logging module for controllable verbosity and cleaner SLURM logs.
Apply:
+#!/usr/bin/env python3 +# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 + import argparse import os import socket import time +import logging +from typing import List import yaml
8-30
: Add wait timeout and log-level flags; initialize logging earlyBlocking indefinitely is brittle in SLURM runs. Expose a timeout and configurable log level, then initialize logging before any waits.
if __name__ == "__main__": parser = argparse.ArgumentParser() parser.add_argument("--num_ctx_servers", type=int, required=True, help="Number of context servers") parser.add_argument("--num_gen_servers", type=int, required=True, help="Number of generation servers") parser.add_argument("--work_dir", type=str, default="logs", help="Work directory") parser.add_argument("--worker_port", type=int, default=8336, help="Worker port") parser.add_argument("--server_port", type=int, default=8333, help="Server port") + parser.add_argument("--wait_timeout_sec", + type=int, + default=600, + help="Max seconds to wait for hostname files before failing") + parser.add_argument("--log_level", + type=str, + default="INFO", + choices=["DEBUG", "INFO", "WARNING", "ERROR", "CRITICAL"], + help="Logging level") args = parser.parse_args() + + logging.basicConfig( + level=getattr(logging, args.log_level.upper(), logging.INFO), + format="%(asctime)s %(levelname)s %(message)s", + ) + logger = logging.getLogger(__name__)
36-49
: Avoid infinite wait; check CTX/GEN counts separately; dedup and validate; fix E501 by loggingThe current polling can block forever and relies on a combined count. Filter CTX/GEN files, enforce a deadline, deduplicate hostnames, and replace long prints with concise logs.
- #check all of the hostnames in the hostnames folder exists, if not, sleep 10 seconds and check again - hostnames_folder = os.path.join(args.work_dir, "hostnames") - while not os.path.exists(hostnames_folder): - time.sleep(10) - print(f"Waiting for hostnames folder {hostnames_folder} to be found") - hostnames = os.listdir(hostnames_folder) - # check length of hostnames is equal to num_ctx_servers + num_gen_servers, if not, sleep 10 seconds and check again - while len(hostnames) != args.num_ctx_servers + args.num_gen_servers: - time.sleep(10) - hostnames = os.listdir(hostnames_folder) - print( - f"Waiting for hostnames to be found in {hostnames_folder}, current length: {len(hostnames)}, expected length: {args.num_ctx_servers + args.num_gen_servers}" - ) - print(f"All hostnames found in {hostnames_folder}") + # Wait for hostnames/ to appear and contain the expected CTX/GEN files + hostnames_folder = os.path.join(args.work_dir, "hostnames") + deadline = time.monotonic() + args.wait_timeout_sec + while True: + if os.path.exists(hostnames_folder): + all_files = os.listdir(hostnames_folder) + ctx_files = sorted(f for f in all_files if f.startswith("CTX")) + gen_files = sorted(f for f in all_files if f.startswith("GEN")) + if (len(ctx_files) == args.num_ctx_servers and + len(gen_files) == args.num_gen_servers): + break + logger.info("Waiting for hostnames in %s: CTX=%d/%d, GEN=%d/%d", + hostnames_folder, len(ctx_files), args.num_ctx_servers, + len(gen_files), args.num_gen_servers) + else: + logger.info("Waiting for hostnames folder %s to be created ...", + hostnames_folder) + if time.monotonic() >= deadline: + raise TimeoutError( + f"Timed out after {args.wait_timeout_sec}s waiting for " + f"hostname files in {hostnames_folder}" + ) + time.sleep(10) + logger.info("All required hostname files found in %s", hostnames_folder) - # get the ctx and gen hostnames from the hostnames file - ctx_hostnames = [] - gen_hostnames = [] - for hostname_file in hostnames: - hostname_file_path = os.path.join(hostnames_folder, hostname_file) - with open(hostname_file_path, 'r') as f: - actual_hostname = f.read().strip() - print(f"Hostname: {actual_hostname} in {hostname_file}") - - if hostname_file.startswith("CTX"): - ctx_hostnames.append(actual_hostname) - elif hostname_file.startswith("GEN"): - gen_hostnames.append(actual_hostname) + # Read actual hostnames from CTX/GEN files + ctx_hostnames: List[str] = [] + for hostname_file in ctx_files: + hostname_file_path = os.path.join(hostnames_folder, hostname_file) + with open(hostname_file_path, "r", encoding="utf-8") as f: + actual_hostname = f.read().strip() + logger.debug("CTX host: %s (%s)", actual_hostname, hostname_file) + ctx_hostnames.append(actual_hostname) + + gen_hostnames: List[str] = [] + for hostname_file in gen_files: + hostname_file_path = os.path.join(hostnames_folder, hostname_file) + with open(hostname_file_path, "r", encoding="utf-8") as f: + actual_hostname = f.read().strip() + logger.debug("GEN host: %s (%s)", actual_hostname, hostname_file) + gen_hostnames.append(actual_hostname) + + # Deduplicate while preserving order + ctx_hostnames = list(dict.fromkeys(ctx_hostnames)) + gen_hostnames = list(dict.fromkeys(gen_hostnames)) + + if (len(ctx_hostnames) != args.num_ctx_servers or + len(gen_hostnames) != args.num_gen_servers): + raise RuntimeError( + "Mismatch between expected and discovered hosts: " + f"expected CTX={args.num_ctx_servers}, GEN={args.num_gen_servers}; " + f"found CTX={len(ctx_hostnames)}, GEN={len(gen_hostnames)}" + ) - print(f"ctx_hostnames: {ctx_hostnames}") - print(f"gen_hostnames: {gen_hostnames}") + logger.info("ctx_hostnames: %s", ctx_hostnames) + logger.info("gen_hostnames: %s", gen_hostnames)Also applies to: 51-66
examples/disaggregated/slurm/benchmark/gen_worker_config.py (4)
1-4
: Add NVIDIA 2025 headerMandatory per repo guidelines.
+#!/usr/bin/env python3 +# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved. + import argparse import os
21-22
: Align default cache_transceiver_max_num_tokens with CLI (8448)Signature default 4608 disagrees with the CLI default 8448; this can surprise direct callers.
- cache_transceiver_max_num_tokens: int = 4608) -> None: + cache_transceiver_max_num_tokens: int = 8448) -> None:
23-45
: Replace stale/inaccurate docstring with a correct Google-style docstringThe current docstring lists non-existent args and omits real ones.
- """ - Generate configuration YAML file for disaggregated inference. - - Args: - config_path: Path to save the config file - model_path: Path to the model - num_ctx_servers: Number of context servers - ctx_tp_size: Tensor parallel size for context servers - ctx_batch_size: Batch size for context servers - ctx_max_num_tokens: Max number of tokens for context servers - ctx_max_seq_len: Max sequence length for context servers - ctx_free_gpu_memory_fraction: Free GPU memory fraction for context servers - ctx_enable_attention_dp: Enable attention DP for context servers - num_gen_servers: Number of generation servers - gen_tp_size: Tensor parallel size for generation servers - gen_batch_size: Batch size for generation servers - gen_max_num_tokens: Max number of tokens for generation servers - gen_enable_attention_dp: Enable attention DP for generation servers - gen_gpu_memory_fraction: GPU memory fraction for generation servers - eplb_num_slots: Number of slots for eplb - worker_start_port: Start port for workers - server_port: Server port - """ + """ + Generate per-role YAML configs for disaggregated workers. + + Writes two files into work_dir: + - ctx_config.yaml: context workers + - gen_config.yaml: generation workers + + Args: + work_dir: Directory to write the YAML files. + ctx_tp_size: Tensor-parallel size for context workers. + ctx_batch_size: Max batch size for context workers. + ctx_max_num_tokens: Max tokens per batch for context workers. + ctx_max_seq_len: Max sequence length for context workers. + ctx_free_gpu_memory_fraction: Fraction of GPU memory kept free on ctx. + ctx_enable_attention_dp: Enable attention data parallel on ctx. + gen_tp_size: Tensor-parallel size for generation workers. + gen_batch_size: Max batch size for generation workers. + gen_max_num_tokens: Max tokens per batch for generation workers. + gen_max_seq_len: Max sequence length for generation workers. + gen_enable_attention_dp: Enable attention data parallel on gen. + gen_gpu_memory_fraction: Fraction of GPU memory kept free on gen. + eplb_num_slots: MoE expert-parallel load balancer slots (0 disables). + mtp_size: If > 0, enable MTP speculative decoding with this many layers. + cache_transceiver_max_num_tokens: Max tokens buffered by transceiver. + + Returns: + None + """
132-137
: Create work_dir before writing; avoid shadowing function nameWriting YAMLs fails if work_dir is missing. Also, local variable gen_config_file shadows the function name; rename for clarity.
- ctx_config_file = os.path.join(work_dir, "ctx_config.yaml") - gen_config_file = os.path.join(work_dir, "gen_config.yaml") - with open(ctx_config_file, "w") as f: + os.makedirs(work_dir, exist_ok=True) + ctx_config_file = os.path.join(work_dir, "ctx_config.yaml") + gen_config_path = os.path.join(work_dir, "gen_config.yaml") + with open(ctx_config_file, "w", encoding="utf-8") as f: yaml.dump(ctx_config, f, default_flow_style=False, sort_keys=False) - with open(gen_config_file, "w") as f: + with open(gen_config_path, "w", encoding="utf-8") as f: yaml.dump(gen_config, f, default_flow_style=False, sort_keys=False) - print( - f"ctx_config_file: {ctx_config_file} gen_config_file: {gen_config_file} generated successfully" - ) + print(f"ctx_config_file: {ctx_config_file} " + f"gen_config_file: {gen_config_path} generated successfully")Also applies to: 139-141
examples/disaggregated/slurm/benchmark/README.md (1)
114-117
: Startup order: start the server before workers (or document worker retry)Current steps launch workers before the server, but workers do not implement retry/wait for server readiness. Safer to start the server first.
-5. `disaggr_torch.slurm` runs `gen_worker_config.py` to create worker configuration files. -6. `disaggr_torch.slurm` uses `srun` to launch `start_worker.sh` on all nodes, starting the MPI workers for both context and generation phases. -7. `disaggr_torch.slurm` starts the main `trtllm-serve` process using `start_server.sh`, which generates the server configuration using `gen_server_config.py`. +5. `disaggr_torch.slurm` runs `gen_worker_config.py` to create worker configuration files. +6. `disaggr_torch.slurm` starts the main `trtllm-serve` process using `start_server.sh`, which generates the server configuration via `gen_server_config.py`. +7. `disaggr_torch.slurm` uses `srun` to launch `start_worker.sh` on all nodes. Workers will wait/retry until the server is reachable. (If no retry is implemented, keep this step after server startup.)examples/disaggregated/slurm/benchmark/start_worker.sh (4)
1-5
: Fix shebang and enable strict mode with pipefail + safe IFSImprove portability and early failure behavior.
-#! /bin/bash -set -u -set -e -set -x +#!/usr/bin/env bash +set -euo pipefail +set -x +IFS=$'\n\t'
6-15
: Validate argument count and provide usagePrevents undefined behavior when fewer than 9 args are supplied.
role=$1 instance_id=$2 model_path=$3 port=$4 benchmark_mode=$5 concurrency=$6 enable_pdl=$7 work_dir=$8 nsys_folder=${9:-} +if [ "$#" -lt 9 ]; then + echo "Usage: $0 <role:CTX|GEN> <instance_id> <model_path> <port> <benchmark_mode:e2e|gen_only> <concurrency> <enable_pdl:true|false> <work_dir> <nsys_folder|empty>" >&2 + exit 2 +fi
31-40
: Check config existence after selectionFail fast with a clear message if the resolved YAML is missing.
if [ "${role}" = "CTX" ]; then config_file=${work_dir}/ctx_config.yaml elif [ "${role}" = "GEN" ]; then config_file=${work_dir}/gen_config.yaml else echo "Invalid role: ${role}" exit 1 fi -echo "config_file: ${config_file}" +echo "config_file: ${config_file}" +if [ ! -f "${config_file}" ]; then + echo "Config file not found: ${config_file}" >&2 + exit 1 +fi
51-69
: Use arrays for NSYS prefix; quote all args; fix SC2089/SC2090/SC2046Compose NSYS arguments as an array and quote all parameter expansions.
-if [ -z "${nsys_folder:-}" ]; then - echo "nsys is not enabled, start normal flow" - trtllm-llmapi-launch trtllm-serve ${model_path} --host $(hostname) --port ${port} --extra_llm_api_options ${config_file} +if [ -z "${nsys_folder:-}" ]; then + echo "nsys is not enabled, start normal flow" + trtllm-llmapi-launch trtllm-serve "${model_path}" \ + --host "$(hostname)" --port "${port}" \ + --extra_llm_api_options "${config_file}" else - nsys_prefix="" - nsys_file=${nsys_folder}/nsys_worker_proc_${instance_id}_${SLURM_PROCID} + nsys_args=() + nsys_file="${nsys_folder}/nsys_worker_proc_${instance_id}_${SLURM_PROCID}" export TLLM_PROFILE_RECORD_GC=1 export TLLM_NVTX_DEBUG=1 if [ "${role}" = "GEN" ]; then export TLLM_PROFILE_START_STOP=200-250 - nsys_prefix="nsys profile -e \"NSYS_MPI_STORE_TEAMS_PER_RANK=1\" -o ${nsys_file} -f true -t cuda,nvtx,python-gil -c cudaProfilerApi --cuda-graph-trace node --capture-range-end=stop --gpu-metrics-devices=none" - echo "nsys_prefix: ${nsys_prefix}" + nsys_args=(nsys profile -e NSYS_MPI_STORE_TEAMS_PER_RANK=1 -o "${nsys_file}" -f true -t cuda,nvtx,python-gil -c cudaProfilerApi --cuda-graph-trace node --capture-range-end=stop --gpu-metrics-devices=none) + echo "nsys_args: ${nsys_args[*]}" elif [ "${role}" = "CTX" ]; then echo "nsys is not enabled on ctx_gpus" fi - trtllm-llmapi-launch ${nsys_prefix} \ - trtllm-serve ${model_path} \ - --host $(hostname) --port ${port} \ - --extra_llm_api_options ${config_file} + trtllm-llmapi-launch "${nsys_args[@]}" \ + trtllm-serve "${model_path}" \ + --host "$(hostname)" --port "${port}" \ + --extra_llm_api_options "${config_file}" fi
🧹 Nitpick comments (7)
examples/disaggregated/slurm/benchmark/gen_server_config.py (3)
68-75
: Prefer logging for local hostname; minor nitUse logging instead of print for consistency.
- # get current hostname from env - hostname = socket.gethostname() - print(f"Current hostname: {hostname}") + # get current hostname + hostname = socket.gethostname() + logger.info("Current hostname: %s", hostname)
86-90
: Stabilize YAML output and final messageExplicit YAML options yield deterministic output; switch final print to logging.
- with open(os.path.join(args.work_dir, "server_config.yaml"), "w") as f: - yaml.dump(server_config, f) - print( - f"Server config file {os.path.join(args.work_dir, 'server_config.yaml')} generated" - ) + out_path = os.path.join(args.work_dir, "server_config.yaml") + with open(out_path, "w", encoding="utf-8") as f: + yaml.dump(server_config, f, default_flow_style=False, sort_keys=False) + logger.info("Server config generated: %s", out_path)
22-29
: Validate port ranges (0 < port < 65536)Defensive check to prevent misconfigurations that would cause bind/connect failures.
parser.add_argument("--worker_port", type=int, default=8336, help="Worker port") parser.add_argument("--server_port", type=int, default=8333, help="Server port") args = parser.parse_args() + + for p_name in ("worker_port", "server_port"): + p_val = getattr(args, p_name) + if not (0 < p_val < 65536): + raise ValueError(f"{p_name} must be in (0, 65535), got {p_val}")examples/disaggregated/slurm/benchmark/gen_worker_config.py (2)
67-69
: Ensure CUDA graph batch sizes are unique and sortedAvoid duplicates and keep the list monotonically increasing.
- gen_cuda_graph_batch_sizes = [ - 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 768, 1024, 2048, gen_batch_size - ] + gen_cuda_graph_batch_sizes = sorted({ + 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 768, 1024, 2048, gen_batch_size + })
1-6
: Optional: switch to logging and expose --log_levelFor parity with gen_server_config.py and easier debugging in SLURM.
import argparse import os +import logging import yaml @@ if __name__ == "__main__": parser = argparse.ArgumentParser() + parser.add_argument("--log_level", + type=str, + default="INFO", + choices=["DEBUG", "INFO", "WARNING", "ERROR", "CRITICAL"], + help="Logging level") @@ args = parser.parse_args() + logging.basicConfig( + level=getattr(logging, args.log_level.upper(), logging.INFO), + format="%(asctime)s %(levelname)s %(message)s", + ) + logger = logging.getLogger(__name__) + gen_config_file(args.work_dir, args.ctx_tp_size, args.ctx_batch_size, args.ctx_max_num_tokens, args.ctx_max_seq_len, args.ctx_free_gpu_memory_fraction, args.ctx_enable_attention_dp, args.gen_tp_size, args.gen_batch_size, args.gen_max_num_tokens, args.gen_max_seq_len, args.gen_enable_attention_dp, args.gen_gpu_memory_fraction, args.eplb_num_slots, args.mtp_size, args.cache_transceiver_max_num_tokens)Also applies to: 145-151, 212-221
examples/disaggregated/slurm/benchmark/README.md (1)
14-14
: Grammar nit: add article“starting disaggregated server” → “starting the disaggregated server”.
-6. `start_server.sh`: A shell script responsible for starting disaggregated server using `trtllm-serve` on each allocated machine. +6. `start_server.sh`: A shell script responsible for starting the disaggregated server using `trtllm-serve` on each allocated machine.examples/disaggregated/slurm/benchmark/start_worker.sh (1)
16-19
: Quote variable expansions in echoes and commandsMinor, but prevents word splitting if values contain spaces.
-echo "concurrency: ${concurrency}, enable_pdl: ${enable_pdl}, work_dir: ${work_dir}" -echo "SLURM_PROCID: ${SLURM_PROCID}, hostname: $(hostname), instance_id: ${instance_id}" +echo "concurrency: ${concurrency}, enable_pdl: ${enable_pdl}, work_dir: ${work_dir}" +echo "SLURM_PROCID: ${SLURM_PROCID}, hostname: $(hostname), instance_id: ${instance_id}"(Note: keep as-is if your style excludes quoting inside echo for simple status lines.)
Also applies to: 53-53, 68-68
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (8)
examples/disaggregated/slurm/benchmark/README.md
(3 hunks)examples/disaggregated/slurm/benchmark/disaggr_torch.slurm
(6 hunks)examples/disaggregated/slurm/benchmark/gen_server_config.py
(1 hunks)examples/disaggregated/slurm/benchmark/gen_worker_config.py
(1 hunks)examples/disaggregated/slurm/benchmark/gen_yaml.py
(0 hunks)examples/disaggregated/slurm/benchmark/run_benchmark.sh
(4 hunks)examples/disaggregated/slurm/benchmark/start_server.sh
(1 hunks)examples/disaggregated/slurm/benchmark/start_worker.sh
(2 hunks)
💤 Files with no reviewable changes (1)
- examples/disaggregated/slurm/benchmark/gen_yaml.py
🚧 Files skipped from review as they are similar to previous changes (3)
- examples/disaggregated/slurm/benchmark/start_server.sh
- examples/disaggregated/slurm/benchmark/disaggr_torch.slurm
- examples/disaggregated/slurm/benchmark/run_benchmark.sh
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py
: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else
Files:
examples/disaggregated/slurm/benchmark/gen_worker_config.py
examples/disaggregated/slurm/benchmark/gen_server_config.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
examples/disaggregated/slurm/benchmark/gen_worker_config.py
examples/disaggregated/slurm/benchmark/gen_server_config.py
🧠 Learnings (1)
📚 Learning: 2025-08-12T10:28:57.320Z
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-12T10:28:57.320Z
Learning: Applies to **/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py} : Prepend NVIDIA copyright header (current year) to all source files
Applied to files:
examples/disaggregated/slurm/benchmark/gen_worker_config.py
🪛 LanguageTool
examples/disaggregated/slurm/benchmark/README.md
[grammar] ~14-~14: There might be a mistake here.
Context: ...ver.sh: A shell script responsible for starting disaggregated server using
trtllm-serv...
(QB_NEW_EN)
[grammar] ~14-~14: There might be a mistake here.
Context: ...trtllm-serveon each allocated machine. 7.
run_benchmark.sh`: A shell script that waits for the serv...
(QB_NEW_EN)
🪛 Shellcheck (0.10.0)
examples/disaggregated/slurm/benchmark/start_worker.sh
[warning] 46-46: Quote this to prevent word splitting.
(SC2046)
[warning] 53-53: Quote this to prevent word splitting.
(SC2046)
[warning] 61-61: Quotes/backslashes will be treated literally. Use an array.
(SC2089)
[warning] 66-66: Quotes/backslashes in this variable will not be respected.
(SC2090)
[warning] 68-68: Quote this to prevent word splitting.
(SC2046)
🪛 Ruff (0.12.2)
examples/disaggregated/slurm/benchmark/gen_server_config.py
47-47: Line too long (168 > 120)
(E501)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (4)
examples/disaggregated/slurm/benchmark/disaggr_torch.slurm (4)
119-121
: Use a consistent “nsys_folder” arg name and semantics across script and workersEarlier reviews noted workers expect a folder path parameter (nsys_folder). This script still defines/passes nsys_on. Standardize to nsys_folder and pass it through.
Apply:
-nsys_on="" -# nsys_on=${full_logdir} # Uncomment this line to enable Nsys profiling +nsys_folder="" +# nsys_folder=${full_logdir}/nsys # Uncomment to enable Nsys profiling; workers will write profiles here @@ - bash ${workdir}/start_worker.sh "GEN" ${i} ${model_dir} "8336" ${benchmark_mode} ${concurrency} ${enable_pdl} ${full_logdir} ${nsys_on} \ + bash ${workdir}/start_worker.sh "GEN" ${i} ${model_dir} "8336" ${benchmark_mode} ${concurrency} ${enable_pdl} ${full_logdir} ${nsys_folder} \ @@ - bash ${workdir}/start_worker.sh "CTX" ${i} ${model_dir} "8336" ${benchmark_mode} ${concurrency} ${enable_pdl} ${full_logdir} ${nsys_on} \ + bash ${workdir}/start_worker.sh "CTX" ${i} ${model_dir} "8336" ${benchmark_mode} ${concurrency} ${enable_pdl} ${full_logdir} ${nsys_folder} \Verification script to confirm start_worker.sh’s expected last-arg name:
#!/bin/bash # Inspect worker launcher args to confirm the profiling arg name/semantics fd -a start_worker.sh rg -nP 'add_argument|^\s*#.*nsys|nsys_folder|nsys_on' $(fd -a start_worker.sh)Also applies to: 196-197, 214-215
165-178
: Bound ctx_nodes slice and add capacity guardCurrent code assigns ctx_nodes to the remainder of all_nodes, not the exact required count, and doesn’t fail fast if requested nodes exceed allocation.
Apply:
all_nodes=($(scontrol show hostname $SLURM_NODELIST | sort)) total_nodes_num=${#all_nodes[@]} echo "all_nodes: ${all_nodes[@]}, total_nodes_num: ${total_nodes_num}" # get the node list for the gen workers total_gen_nodes_num=$((gen_nodes_num * num_gen_servers)) gen_nodes=(${all_nodes[@]:0:${total_gen_nodes_num}}) echo "gen_nodes: ${gen_nodes[@]}, total_gen_nodes_num: ${total_gen_nodes_num}" # get the node list for the ctx workers total_ctx_nodes_num=$((ctx_nodes_num * num_ctx_servers)) -ctx_nodes=(${all_nodes[@]:${total_gen_nodes_num}:${total_nodes_num}}) +if (( total_gen_nodes_num + total_ctx_nodes_num > total_nodes_num )); then + echo "Error: Requested nodes (gen:${total_gen_nodes_num} + ctx:${total_ctx_nodes_num}) exceed allocated nodes (${total_nodes_num})." >&2 + exit 1 +fi +ctx_nodes=(${all_nodes[@]:${total_gen_nodes_num}:${total_ctx_nodes_num}}) echo "ctx_nodes: ${ctx_nodes[@]}, total_ctx_nodes_num: ${total_ctx_nodes_num}"
187-197
: Remove invalid srun --segment and avoid overspecifying --ntasks-per-node; quote nodelist
--segment
is not a valid Slurm option; this will fail immediately. Also, forcing--ntasks-per-node
can overconstrain placement when tasks aren’t divisible. Since you already pass an explicit nodelist and total --ntasks, omit it (or compute it only when divisible).Apply:
srun -l -N ${gen_nodes_num} \ --ntasks=${gen_tp_size} \ - --ntasks-per-node=${ntasks_per_node} \ - --segment=${gen_nodes_num} \ --container-image=${container_image} \ --container-name=${container_name} \ --container-mounts=${mounts} \ - --nodelist=$(IFS=,; echo "${node_list[*]}") \ + --nodelist="$(IFS=,; echo "${node_list[*]}")" \ --mpi=pmix \ bash ${workdir}/start_worker.sh "GEN" ${i} ${model_dir} "8336" ${benchmark_mode} ${concurrency} ${enable_pdl} ${full_logdir} ${nsys_on} \ &> ${full_logdir}/output_gen_${i}.log &If you must constrain per-node tasks, compute it only when divisible:
# before the loop if (( gen_tp_size % gen_nodes_num == 0 )); then gen_tasks_per_node=$(( gen_tp_size / gen_nodes_num )); gen_ntpn=(--ntasks-per-node="${gen_tasks_per_node}"); else gen_ntpn=(); fi # and add: "${gen_ntpn[@]}" after --ntasks
205-215
: Mirror the GEN fixes: drop --segment, be cautious with --ntasks-per-node; quote nodelistSame as above for CTX: remove invalid
--segment
, don’t overspecify--ntasks-per-node
unless divisible, and quote the nodelist.Apply:
srun -l -N ${ctx_nodes_num} \ --ntasks=${ctx_tp_size} \ - --ntasks-per-node=${ntasks_per_node} \ - --segment=${ctx_nodes_num} \ --container-image=${container_image} \ --container-name=${container_name} \ --container-mounts=${mounts} \ - --nodelist=$(IFS=,; echo "${node_list[*]}") \ + --nodelist="$(IFS=,; echo "${node_list[*]}")" \ --mpi=pmix \ bash ${workdir}/start_worker.sh "CTX" ${i} ${model_dir} "8336" ${benchmark_mode} ${concurrency} ${enable_pdl} ${full_logdir} ${nsys_on} \ &> ${full_logdir}/output_ctx_${i}.log &
🧹 Nitpick comments (3)
examples/disaggregated/slurm/benchmark/disaggr_torch.slurm (3)
49-51
: Derive tasks-per-node robustly from Slurm env (TASKS_PER_NODE may be the only one set)
SLURM_NTASKS_PER_NODE
is not always exported; clusters commonly setSLURM_TASKS_PER_NODE
(e.g., “4(x2)” or “2,2”). Parse it when available; fall back sanely.Apply:
-# Get GPUs per node dynamically from SLURM -ntasks_per_node=${SLURM_NTASKS_PER_NODE:-4} # Default to 4 for GB200 +# Derive tasks per node from Slurm (prefer SLURM_TASKS_PER_NODE), else default. +if [[ -n "${SLURM_TASKS_PER_NODE:-}" ]]; then + # Example formats: "4(x2)", "2,2", "1(x3),2" -> pick the first number + ntasks_per_node="$(sed -E 's/^([0-9]+).*/\1/' <<< "${SLURM_TASKS_PER_NODE}")" +elif [[ -n "${SLURM_NTASKS_PER_NODE:-}" ]]; then + ntasks_per_node="${SLURM_NTASKS_PER_NODE}" +else + ntasks_per_node=4 # Default for GB200 +fi
10-13
: Ensure cleanup always runs: use a trap instead of a final scancelWith
set -e
, earlier failures skip the finalscancel
. Add a trap so the job gets cancelled on any error or signal; you can then drop the explicitscancel
at the end.Apply:
set -u set -e +set -o pipefail set -x +cleanup() { scancel "${SLURM_JOB_ID}" 2>/dev/null || true; } +trap cleanup EXIT INT TERM @@ -scancel ${SLURM_JOB_ID} +# cleanup handled by trapAlso applies to: 233-233
94-96
: Unused variables (ctx_gpus/gen_gpus)
ctx_gpus
andgen_gpus
are computed but never used. Drop them or print for observability.Apply:
-ctx_gpus=$((num_ctx_servers * ctx_tp_size)) -gen_gpus=$((num_gen_servers * gen_tp_size)) +# ctx_gpus=$((num_ctx_servers * ctx_tp_size)) +# gen_gpus=$((num_gen_servers * gen_tp_size))
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
examples/disaggregated/slurm/benchmark/disaggr_torch.slurm
(6 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (1)
examples/disaggregated/slurm/benchmark/disaggr_torch.slurm (1)
146-153
: No action needed: Flags correctly match parser definitionsThe Slurm script is using the exact flag names defined in
gen_worker_config.py
:
--ctx_free_gpu_memory_fraction
corresponds toparser.add_argument("--ctx_free_gpu_memory_fraction", …)
(line 167)--gen_gpu_memory_fraction
corresponds toparser.add_argument("--gen_gpu_memory_fraction", …)
(line 195)There is no mismatch between the script and the generator’s expected arguments.
Likely an incorrect or invalid review comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
examples/disaggregated/slurm/benchmark/disaggr_torch.slurm (1)
129-134
: Ensurepipefail
is enabled to surface errors in piped commandsThe check
PIPEFAIL=off
shows thatpipefail
is not active in the shell running this SLURM script, so any failures in thesrun … | tee
pipeline will be masked. We must explicitly enable it at the top of the script.• File: examples/disaggregated/slurm/benchmark/disaggr_torch.slurm
• Add before any piped commands (e.g., before line 1):- #!/bin/bash + #!/bin/bash + set -o pipefailThis change ensures that if
pip install -e .
(or any other stage) fails, the pipeline aborts with a non-zero exit code and CI will catch the error.
Applies likewise to the block on lines 136–159.
♻️ Duplicate comments (3)
examples/disaggregated/slurm/benchmark/disaggr_torch.slurm (3)
194-205
: Mirror the GEN fix for CTX loop: don’t force --ntasks-per-node unless safeSame rationale as the GEN loop: either omit or set conditionally when evenly divisible.
+ctx_ntasks_per_node_arg=() +if (( ctx_tp_size % ctx_nodes_num == 0 )); then + ctx_ntasks_per_node_arg=(--ntasks-per-node=$(( ctx_tp_size / ctx_nodes_num ))) +fi srun -l -N ${ctx_nodes_num} \ - --ntasks=${ctx_tp_size} \ - --ntasks-per-node=${ntasks_per_node} \ + --ntasks=${ctx_tp_size} "${ctx_ntasks_per_node_arg[@]}" \ --container-image=${container_image} \
10-13
: Enable pipefail so failures in piped commands (e.g., srun | tee) fail the scriptLater you pipe to tee; without pipefail, upstream failures are ignored and the script continues, producing misleading “success.” Add pipefail here.
set -u set -e +set -o pipefail set -x
165-168
: Slice the ctx_nodes to the exact requested count and add a capacity checkRight now ctx_nodes uses the remainder of all_nodes, which can overshoot the requested count and mask allocation mistakes. Add a guard and bound the slice.
all_nodes=($(scontrol show hostname $SLURM_NODELIST | sort)) total_nodes_num=${#all_nodes[@]} echo "all_nodes: ${all_nodes[@]}, total_nodes_num: ${total_nodes_num}" # get the node list for the gen workers total_gen_nodes_num=$((gen_nodes_num * num_gen_servers)) gen_nodes=(${all_nodes[@]:0:${total_gen_nodes_num}}) echo "gen_nodes: ${gen_nodes[@]}, total_gen_nodes_num: ${total_gen_nodes_num}" # get the node list for the ctx workers total_ctx_nodes_num=$((ctx_nodes_num * num_ctx_servers)) -ctx_nodes=(${all_nodes[@]:${total_gen_nodes_num}:${total_nodes_num}}) +if (( total_gen_nodes_num + total_ctx_nodes_num > total_nodes_num )); then + echo "Error: Requested nodes exceed allocation: gen=${total_gen_nodes_num} ctx=${total_ctx_nodes_num} total=${total_nodes_num}" >&2 + exit 1 +fi +ctx_nodes=(${all_nodes[@]:${total_gen_nodes_num}:${total_ctx_nodes_num}}) echo "ctx_nodes: ${ctx_nodes[@]}, total_ctx_nodes_num: ${total_ctx_nodes_num}"Also applies to: 169-173, 174-178
🧹 Nitpick comments (6)
examples/disaggregated/slurm/benchmark/disaggr_torch.slurm (6)
49-51
: Prefer SLURM_GPUS_ON_NODE as fallback when deriving ntasks_per_nodeOn GPU jobs, GPUS_ON_NODE is often the better signal than NTASKS_PER_NODE. Keep your current default but prefer GPUs when available.
-ntasks_per_node=${SLURM_NTASKS_PER_NODE:-4} # Default to 4 for GB200 +ntasks_per_node="${SLURM_NTASKS_PER_NODE:-${SLURM_GPUS_ON_NODE:-4}}" # Prefer GPUs-per-node when present; default 4 for GB200
162-164
: Derivation of per-group node counts looks right; consider guarding for zero or negative valuesIf tp_size is 0 (misconfiguration), these could compute to 0 and later srun -N 0 will fail obscurely. A simple validation improves UX.
ctx_nodes_num=$(((ctx_tp_size + ntasks_per_node - 1) / ntasks_per_node)) gen_nodes_num=$(((gen_tp_size + ntasks_per_node - 1) / ntasks_per_node)) +if (( ctx_nodes_num <= 0 || gen_nodes_num <= 0 )); then + echo "Error: computed nodes per group (ctx=${ctx_nodes_num}, gen=${gen_nodes_num}) must be > 0. Check *_tp_size and ntasks_per_node." >&2 + exit 1 +fi
179-180
: Nit: leftover cleanup path is now unused elsewhereIf hostnames artifacts are no longer produced, consider removing this cleanup to avoid confusion.
-rm -rf ${full_logdir}/hostnames +# Legacy cleanup (hostnames) removed; no longer produced
181-193
: Avoid overspecifying --ntasks-per-node; let Slurm place tasks unless evenly divisibleYou already compute -N and total --ntasks. For non-divisible cases, forcing --ntasks-per-node can overconstrain and fail placement. Either drop it or add it conditionally when divisible.
Option A (simpler): drop it
srun -l -N ${gen_nodes_num} \ --ntasks=${gen_tp_size} \ - --ntasks-per-node=${ntasks_per_node} \ --container-image=${container_image} \
Option B (conditional):
+gen_ntasks_per_node_arg=() +if (( gen_tp_size % gen_nodes_num == 0 )); then + gen_ntasks_per_node_arg=(--ntasks-per-node=$(( gen_tp_size / gen_nodes_num ))) +fi srun -l -N ${gen_nodes_num} \ - --ntasks=${gen_tp_size} \ - --ntasks-per-node=${ntasks_per_node} \ + --ntasks=${gen_tp_size} "${gen_ntasks_per_node_arg[@]}" \ --container-image=${container_image} \
222-222
: Be cautious with scancel inside the job; ensure all logs are flushedCalling scancel on the current job is fine here but can truncate trailing logs. Consider a short wait or ensuring backgrounded sruns have terminated before scancel.
- scancel ${SLURM_JOB_ID} +# Give backgrounded workers a moment to flush logs before cancellation (optional) +sleep 2 +scancel ${SLURM_JOB_ID}
119-121
: Standardize Nsys profiling directory usageChange the profiling flag to a directory path (
nsys_folder
) and create it when set. This aligns with the 9th positional argument instart_worker.sh
(verified:$9
→nsys_folder
), and clarifies intent versus a boolean flag.Files/locations to update:
examples/disaggregated/slurm/benchmark/disaggr_torch.slurm
, lines 119–121- Same file, around the two
start_worker.sh
invocations at lines 181–193 and 194–205Suggested diff:
-nsys_on="" -# nsys_on=${full_logdir} # Uncomment this line to enable Nsys profiling +nsys_folder="" +# nsys_folder="${full_logdir}/nsys" # Uncomment to enable Nsys profiling and store traces under this directory +if [ -n "${nsys_folder}" ]; then + mkdir -p "${nsys_folder}" +fi @@ (around start_worker.sh invocations) - bash ${workdir}/start_worker.sh "GEN" ${i} ${model_dir} "8336" ${benchmark_mode} ${concurrency} ${enable_pdl} ${full_logdir} ${nsys_on} \ - bash ${workdir}/start_worker.sh "CTX" ${i} ${model_dir} "8336" ${benchmark_mode} ${concurrency} ${enable_pdl} ${full_logdir} ${nsys_on} \ + bash ${workdir}/start_worker.sh "GEN" ${i} ${model_dir} "8336" ${benchmark_mode} ${concurrency} ${enable_pdl} ${full_logdir} ${nsys_folder} \ + bash ${workdir}/start_worker.sh "CTX" ${i} ${model_dir} "8336" ${benchmark_mode} ${concurrency} ${enable_pdl} ${full_logdir} ${nsys_folder} \
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
examples/disaggregated/slurm/benchmark/disaggr_torch.slurm
(6 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (4)
examples/disaggregated/slurm/benchmark/disaggr_torch.slurm (4)
87-88
: LGTM: container_name/logdir reorg improves clarity and log scopingConsistent container identity and per-job log roots are helpful for debugging. No action needed.
114-117
: LGTM: safe export of TRT_LLM_GIT_COMMIT with fallbackThe short hash export with graceful fallback to “unknown” is robust. No action needed.
209-213
: LGTM: server startup separated and parameterized via start_server.shClear separation and reuse. No action needed.
217-220
: LGTM: benchmark step parameterization and logging target are clearStraightforward pass-through to run_benchmark.sh with logs centralized. No action needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (6)
examples/disaggregated/slurm/benchmark/disaggr_torch.slurm (6)
10-12
: Add pipefail and make tracing optional to avoid masking failures and noisy logs
- Without pipefail, failures in pipelines (e.g., srun|tee, python|tee) won’t stop the script.
- Consider gating
set -x
behind an env flag to keep logs readable.Apply:
set -u set -e -set -x +set -o pipefail +# Enable xtrace only when TRACE=1 +[ "${TRACE:-0}" = "1" ] && set -x
119-121
: Unify Nsys variable naming with start_worker.sh/start_server.shThis script passes
${nsys_on}
to workers, but previous iterations and other scripts often usensys_folder
. Mismatched names will silently disable profiling.
- Confirm what
start_worker.sh
expects (nsys_on vs nsys_folder).- Standardize on one name and create the folder when enabled.
Suggested:
-nsys_on="" -# nsys_on=${full_logdir} # Uncomment this line to enable Nsys profiling +nsys_folder="" +# nsys_folder=${full_logdir}/nsys # Uncomment to enable Nsys profiling +# [ -n "${nsys_folder}" ] && mkdir -p "${nsys_folder}"Then update both worker launches to pass
${nsys_folder}
.To verify across the repo:
#!/bin/bash # Verify argument naming in worker/server scripts rg -nP 'nsys_(on|folder)' examples/disaggregated/slurm/benchmark/start_worker.sh rg -nP 'nsys_(on|folder)' examples/disaggregated/slurm/benchmark/start_server.sh || true
136-159
: YAML generation is piped to tee: without pipefail, failures won’t abortSince this pipeline uses
| tee
, addset -o pipefail
(see earlier comment) so gen failures stop the job.Also, double-check the option names align with
gen_worker_config.py
:
- You pass
--ctx_free_gpu_memory_fraction
but--gen_gpu_memory_fraction
; verify the intended naming asymmetry.To confirm option names:
#!/bin/bash rg -nP 'argparse|add_argument\(' examples/disaggregated/slurm/benchmark/gen_worker_config.py -n -C2
169-178
: Add capacity guard and slice ctx_nodes to the exact required countRight now
ctx_nodes
uses the remainder of all nodes instead of the exact number needed, and there’s no check that requested nodes fit the allocation.Apply:
all_nodes=($(scontrol show hostname $SLURM_NODELIST | sort)) total_nodes_num=${#all_nodes[@]} echo "all_nodes: ${all_nodes[@]}, total_nodes_num: ${total_nodes_num}" # get the node list for the gen workers total_gen_nodes_num=$((gen_nodes_num * num_gen_servers)) gen_nodes=(${all_nodes[@]:0:${total_gen_nodes_num}}) echo "gen_nodes: ${gen_nodes[@]}, total_gen_nodes_num: ${total_gen_nodes_num}" # get the node list for the ctx workers total_ctx_nodes_num=$((ctx_nodes_num * num_ctx_servers)) -ctx_nodes=(${all_nodes[@]:${total_gen_nodes_num}:${total_nodes_num}}) +if (( total_gen_nodes_num + total_ctx_nodes_num > total_nodes_num )); then + echo "Error: Requested nodes (gen:${total_gen_nodes_num} + ctx:${total_ctx_nodes_num}) exceed allocated nodes (${total_nodes_num})." >&2 + exit 1 +fi +ctx_nodes=(${all_nodes[@]:${total_gen_nodes_num}:${total_ctx_nodes_num}}) echo "ctx_nodes: ${ctx_nodes[@]}, total_ctx_nodes_num: ${total_ctx_nodes_num}"
181-192
: Critical: GEN worker srun lacks per-group nodelist and overspecifies --ntasks-per-node
- You compute
gen_nodes
but don’t assign per-group slices to each srun, so simultaneous sruns can contend for the same nodes.- Passing
--ntasks-per-node=${ntasks_per_node}
unconditionally can make--ntasks
inconsistent (e.g., gen_tp_size=6, nodes=2 → requires 12 tasks). This will fail or overconstrain placement.Apply:
-# start the gen workers -for i in $(seq 0 $((num_gen_servers - 1))); do - srun -l -N ${gen_nodes_num} \ - --ntasks=${gen_tp_size} \ - --ntasks-per-node=${ntasks_per_node} \ - --container-image=${container_image} \ - --container-name=${container_name} \ - --container-mounts=${mounts} \ - --mpi=pmix \ - bash ${workdir}/start_worker.sh "GEN" ${i} ${model_dir} "8336" ${benchmark_mode} ${concurrency} ${enable_pdl} ${full_logdir} ${nsys_on} \ - &> ${full_logdir}/output_gen_${i}.log & -done +# start the gen workers +for i in $(seq 0 $((num_gen_servers - 1))); do + start=$(( i * gen_nodes_num )) + node_list=("${gen_nodes[@]:${start}:${gen_nodes_num}}") + # Only constrain tasks-per-node when evenly divisible + if (( gen_tp_size % gen_nodes_num == 0 )); then + gen_tasks_per_node=$(( gen_tp_size / gen_nodes_num )) + gen_ntasks_per_node_arg=(--ntasks-per-node="${gen_tasks_per_node}") + else + gen_ntasks_per_node_arg=() + fi + srun -l -N "${gen_nodes_num}" \ + --ntasks="${gen_tp_size}" \ + "${gen_ntasks_per_node_arg[@]}" \ + --container-image="${container_image}" \ + --container-name="${container_name}_gen_${i}" \ + --container-mounts="${mounts}" \ + --nodelist="$(IFS=,; echo "${node_list[*]}")" \ + --mpi=pmix \ + bash "${workdir}/start_worker.sh" "GEN" "${i}" "${model_dir}" "8336" "${benchmark_mode}" "${concurrency}" "${enable_pdl}" "${full_logdir}" "${nsys_on}" \ + &> "${full_logdir}/output_gen_${i}.log" & +done
195-205
: Critical: CTX worker srun has the same issues as GEN (missing nodelist, overconstrained tasks-per-node)Mirror the GEN fix for CTX.
Apply:
-# start the ctx workers -for i in $(seq 0 $((num_ctx_servers - 1))); do - srun -l -N ${ctx_nodes_num} \ - --ntasks=${ctx_tp_size} \ - --ntasks-per-node=${ntasks_per_node} \ - --container-image=${container_image} \ - --container-name=${container_name} \ - --container-mounts=${mounts} \ - --mpi=pmix \ - bash ${workdir}/start_worker.sh "CTX" ${i} ${model_dir} "8336" ${benchmark_mode} ${concurrency} ${enable_pdl} ${full_logdir} ${nsys_on} \ - &> ${full_logdir}/output_ctx_${i}.log & -done +# start the ctx workers +for i in $(seq 0 $((num_ctx_servers - 1))); do + start=$(( i * ctx_nodes_num )) + node_list=("${ctx_nodes[@]:${start}:${ctx_nodes_num}}") + if (( ctx_tp_size % ctx_nodes_num == 0 )); then + ctx_tasks_per_node=$(( ctx_tp_size / ctx_nodes_num )) + ctx_ntasks_per_node_arg=(--ntasks-per-node="${ctx_tasks_per_node}") + else + ctx_ntasks_per_node_arg=() + fi + srun -l -N "${ctx_nodes_num}" \ + --ntasks="${ctx_tp_size}" \ + "${ctx_ntasks_per_node_arg[@]}" \ + --container-image="${container_image}" \ + --container-name="${container_name}_ctx_${i}" \ + --container-mounts="${mounts}" \ + --nodelist="$(IFS=,; echo "${node_list[*]}")" \ + --mpi=pmix \ + bash "${workdir}/start_worker.sh" "CTX" "${i}" "${model_dir}" "8336" "${benchmark_mode}" "${concurrency}" "${enable_pdl}" "${full_logdir}" "${nsys_on}" \ + &> "${full_logdir}/output_ctx_${i}.log" & +done
🧹 Nitpick comments (4)
examples/disaggregated/slurm/benchmark/disaggr_torch.slurm (4)
49-50
: Deriving per-node capacity from SLURM: verify it matches the intended scheduling unitYou later treat
ntasks_per_node
as the capacity to spread TP across nodes. Ensure this equals your “tasks per node” for each worker type (often GPUs per node). If the cluster exposes GPUs viaSLURM_GPUS_ON_NODE
, prefer that or fall back toSLURM_NTASKS_PER_NODE
.Option:
-ntasks_per_node=${SLURM_NTASKS_PER_NODE:-4} # Default to 4 for GB200 +# Prefer GPUs per node if exported, else tasks-per-node, else 4 +ntasks_per_node=${SLURM_GPUS_ON_NODE:-${SLURM_NTASKS_PER_NODE:-4}}If GEN/CTX use different per-node capacities, you may need separate variables.
162-168
: Node count derivation: ensure the divisor matches per-node capacity and handles edge-cases
ctx_nodes_num
/gen_nodes_num
are computed via ceil-divide byntasks_per_node
. This is fine ifntasks_per_node
is the true per-node capacity for each worker group; otherwise placement will be wrong.If capacities differ per group, derive separately (e.g.,
ctx_tasks_per_node
,gen_tasks_per_node
). Also guard zero/negative inputs if user passes 0 sizes.
216-220
: Nit: pass container-image for benchmark srun for consistencyAll other sruns specify
--container-image
; this one relies on the named container only. Add it for consistency across environments.srun -l --container-name=${container_name} \ - --container-mounts=${mounts} \ + --container-image=${container_image} \ + --container-mounts=${mounts} \ --mpi=pmix --overlap -N 1 -n 1 \
222-222
: Prefer trap-based cleanup to ensure workers/servers are torn down on early failuresIf any stage fails (with
set -e
), the finalscancel
won’t run. A trap guarantees cleanup.+# Ensure cleanup even on early exit +trap 'scancel "${SLURM_JOB_ID}"' EXIT - -scancel ${SLURM_JOB_ID}
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
examples/disaggregated/slurm/benchmark/disaggr_torch.slurm
(6 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (4)
examples/disaggregated/slurm/benchmark/disaggr_torch.slurm (4)
47-48
: LGTM: optional install path is cleanDefaulting
trtllm_repo
to empty string and guarding the install step with-n
prevents accidental installs. Good.
87-88
: LGTM: normalized container name and structured log directoriesUsing a stable container name and job-scoped log roots keeps artifacts organized.
114-118
: LGTM: commit id discovery is resilientThe fallback to “unknown” avoids
set -e
exits whentrtllm_repo
is unset. Good.
207-214
: LGTM: server launch looks consistentServer start uses a single task on one node and passes
${workdir}
correctly.
Signed-off-by: Kaiyu Xie <[email protected]> Update Signed-off-by: Kaiyu Xie <[email protected]> Update Signed-off-by: Kaiyu Xie <[email protected]> Fix Signed-off-by: Kaiyu Xie <[email protected]> Update Signed-off-by: Kaiyu Xie <[email protected]> Update Signed-off-by: Kaiyu Xie <[email protected]> Update documents Signed-off-by: Kaiyu Xie <[email protected]> Better error handling Signed-off-by: Kaiyu Xie <[email protected]> Fix Signed-off-by: Kaiyu Xie <[email protected]> Fix Signed-off-by: Kaiyu Xie <[email protected]> Fix Signed-off-by: Kaiyu Xie <[email protected]> Update Signed-off-by: Kaiyu Xie <[email protected]> Update Signed-off-by: Kaiyu Xie <[email protected]> Minor fix Signed-off-by: Kaiyu Xie <[email protected]>
24453fb
to
0eac701
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
examples/disaggregated/slurm/benchmark/run_benchmark.sh (1)
62-76
: Health check ignores HTTP status; check for 200Current loop exits on any successful TCP connection, even 5xx. Ensure HTTP 200.
-# check server is health by curl every 10 seconds timeout 1800 seconds +# check server health: expect HTTP 200 within timeout timeout=1800 start_time=$(date +%s) -while ! curl -s -o /dev/null -w "%{http_code}" http://${hostname}:${port}/health; do +while true; do + status=$(curl -s -o /dev/null -w "%{http_code}" "http://${hostname}:${port}/health" || echo "000") + if [ "${status}" = "200" ]; then + break + fi current_time=$(date +%s) elapsed=$((current_time - start_time)) if [ $elapsed -ge $timeout ]; then - echo "Error: Server is not healthy after ${timeout} seconds" + echo "Error: Server is not healthy after ${timeout} seconds (last http_code=${status})" exit 1 fi if [ $((elapsed % 30)) -eq 0 ]; then - echo "Waiting for server to be healthy... (${elapsed}s elapsed)" + echo "Waiting for server to be healthy... (${elapsed}s elapsed, http_code=${status})" fi sleep 10 done
♻️ Duplicate comments (17)
examples/disaggregated/slurm/benchmark/gen_server_config.py (4)
1-7
: Add NVIDIA 2025 header and use logging instead of printRequired by repo guidelines; also switch to logging for controllable verbosity.
+#!/usr/bin/env python3 +# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 + import argparse import os import socket import time +import logging import yaml
8-30
: Introduce timeout and log-level flags; initialize logging earlyPrevents indefinite hangs and allows verbosity control.
if __name__ == "__main__": parser = argparse.ArgumentParser() parser.add_argument("--num_ctx_servers", type=int, required=True, help="Number of context servers") parser.add_argument("--num_gen_servers", type=int, required=True, help="Number of generation servers") parser.add_argument("--work_dir", type=str, default="logs", help="Work directory") parser.add_argument("--worker_port", type=int, default=8336, help="Worker port") parser.add_argument("--server_port", type=int, default=8333, help="Server port") + parser.add_argument("--wait_timeout_sec", + type=int, + default=600, + help="Max seconds to wait for hostname files before failing") + parser.add_argument("--log_level", + type=str, + default="INFO", + choices=["DEBUG", "INFO", "WARNING", "ERROR", "CRITICAL"], + help="Logging level") args = parser.parse_args() + + logging.basicConfig( + level=getattr(logging, args.log_level.upper(), logging.INFO), + format="%(asctime)s %(levelname)s %(message)s", + ) + logger = logging.getLogger(__name__)
36-49
: Avoid infinite wait; verify CTX/GEN counts separately; fix long line (Ruff E501)Use a deadline and separate CTX/GEN file checks; replace prints with logging.
- #check all of the hostnames in the hostnames folder exists, if not, sleep 10 seconds and check again - hostnames_folder = os.path.join(args.work_dir, "hostnames") - while not os.path.exists(hostnames_folder): - time.sleep(10) - print(f"Waiting for hostnames folder {hostnames_folder} to be found") - hostnames = os.listdir(hostnames_folder) - # check length of hostnames is equal to num_ctx_servers + num_gen_servers, if not, sleep 10 seconds and check again - while len(hostnames) != args.num_ctx_servers + args.num_gen_servers: - time.sleep(10) - hostnames = os.listdir(hostnames_folder) - print( - f"Waiting for hostnames to be found in {hostnames_folder}, current length: {len(hostnames)}, expected length: {args.num_ctx_servers + args.num_gen_servers}" - ) - print(f"All hostnames found in {hostnames_folder}") + hostnames_folder = os.path.join(args.work_dir, "hostnames") + deadline = time.monotonic() + args.wait_timeout_sec + while True: + if os.path.exists(hostnames_folder): + all_files = os.listdir(hostnames_folder) + ctx_files = sorted(f for f in all_files if f.startswith("CTX")) + gen_files = sorted(f for f in all_files if f.startswith("GEN")) + if (len(ctx_files) == args.num_ctx_servers and + len(gen_files) == args.num_gen_servers): + break + logger.info("Waiting for hostnames in %s: CTX=%d/%d, GEN=%d/%d", + hostnames_folder, len(ctx_files), args.num_ctx_servers, + len(gen_files), args.num_gen_servers) + else: + logger.info("Waiting for hostnames folder %s to be created ...", + hostnames_folder) + if time.monotonic() >= deadline: + raise TimeoutError( + f"Timed out after {args.wait_timeout_sec}s waiting for " + f"hostname files in {hostnames_folder}" + ) + time.sleep(10) + logger.info("All required hostname files found in %s", hostnames_folder)
51-64
: Read CTX/GEN files deterministically; dedupe; use explicit encodingAvoids stray files and duplicate entries.
- # get the ctx and gen hostnames from the hostnames file - ctx_hostnames = [] - gen_hostnames = [] - for hostname_file in hostnames: - hostname_file_path = os.path.join(hostnames_folder, hostname_file) - with open(hostname_file_path, 'r') as f: - actual_hostname = f.read().strip() - print(f"Hostname: {actual_hostname} in {hostname_file}") - - if hostname_file.startswith("CTX"): - ctx_hostnames.append(actual_hostname) - elif hostname_file.startswith("GEN"): - gen_hostnames.append(actual_hostname) + # Read hostnames + ctx_hostnames = [] + for hostname_file in ctx_files: + with open(os.path.join(hostnames_folder, hostname_file), "r", encoding="utf-8") as f: + ctx_hostnames.append(f.read().strip()) + gen_hostnames = [] + for hostname_file in gen_files: + with open(os.path.join(hostnames_folder, hostname_file), "r", encoding="utf-8") as f: + gen_hostnames.append(f.read().strip()) + # Deduplicate while preserving order + ctx_hostnames = list(dict.fromkeys(ctx_hostnames)) + gen_hostnames = list(dict.fromkeys(gen_hostnames))examples/disaggregated/slurm/benchmark/gen_worker_config.py (4)
1-4
: Add NVIDIA 2025 header (and optional shebang)Required header is missing.
+#!/usr/bin/env python3 +# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 + import argparse import os
24-24
: Align default cache_transceiver_max_num_tokens with CLI (8448)Prevents surprising behavior for library callers.
- cache_transceiver_max_num_tokens: int = 4608) -> None: + cache_transceiver_max_num_tokens: int = 8448) -> None:
25-49
: Docstring out of sync with signature; replace with accurate Google-style docstringCurrent docstring lists wrong/missing params.
- """ - Generate configuration YAML file for disaggregated inference. - - Args: - config_path: Path to save the config file - model_path: Path to the model - num_ctx_servers: Number of context servers - ctx_tp_size: Tensor parallel size for context servers - ctx_pp_size: Pipeline parallel size for context servers - ctx_batch_size: Batch size for context servers - ctx_max_num_tokens: Max number of tokens for context servers - ctx_max_seq_len: Max sequence length for context servers - ctx_free_gpu_memory_fraction: Free GPU memory fraction for context servers - ctx_enable_attention_dp: Enable attention DP for context servers - num_gen_servers: Number of generation servers - gen_tp_size: Tensor parallel size for generation servers - gen_pp_size: Pipeline parallel size for generation servers - gen_batch_size: Batch size for generation servers - gen_max_num_tokens: Max number of tokens for generation servers - gen_enable_attention_dp: Enable attention DP for generation servers - gen_gpu_memory_fraction: GPU memory fraction for generation servers - eplb_num_slots: Number of slots for eplb - worker_start_port: Start port for workers - server_port: Server port - """ + """ + Generate per-role YAML configs for disaggregated inference workers. + + Writes two files into work_dir: + - ctx_config.yaml (context servers) + - gen_config.yaml (generation servers) + + Args: + work_dir: Directory for output YAML files. + ctx_tp_size: TP size for context servers. + ctx_pp_size: PP size for context servers. + ctx_batch_size: Max batch size for context servers. + ctx_max_num_tokens: Max tokens per batch for context servers. + ctx_max_seq_len: Max sequence length for context servers. + ctx_free_gpu_memory_fraction: Fraction of GPU memory to keep free (CTX). + ctx_enable_attention_dp: Enable attention DP for CTX. + gen_tp_size: TP size for generation servers. + gen_pp_size: PP size for generation servers. + gen_batch_size: Max batch size for generation servers. + gen_max_num_tokens: Max tokens per batch for generation servers. + gen_max_seq_len: Max sequence length for generation servers. + gen_enable_attention_dp: Enable attention DP for GEN. + gen_gpu_memory_fraction: Fraction of GPU memory to keep free (GEN). + eplb_num_slots: Expert-parallel load balancer slots (0 disables). + mtp_size: If >0, enable MTP speculative decoding with this many layers. + cache_transceiver_max_num_tokens: Max tokens buffered by transceiver. + """
136-137
: Ensure work_dir exists before writing filesPrevents FileNotFoundError.
- ctx_config_file = os.path.join(work_dir, "ctx_config.yaml") - gen_config_file = os.path.join(work_dir, "gen_config.yaml") + os.makedirs(work_dir, exist_ok=True) + ctx_config_path = os.path.join(work_dir, "ctx_config.yaml") + gen_config_path = os.path.join(work_dir, "gen_config.yaml")examples/disaggregated/slurm/benchmark/start_worker.sh (6)
6-14
: Validate arguments early with a usage messageAvoids undefined behavior when too few args are passed.
+if [ "$#" -lt 9 ]; then + echo "Usage: $0 <role:CTX|GEN> <instance_id> <model_path> <port> <benchmark_mode:e2e|gen_only> <concurrency> <enable_pdl:true|false> <work_dir> <nsys_folder|empty>" >&2 + exit 2 +fi role=$1 instance_id=$2 model_path=$3
1-5
: Fix shebang and enable strict mode (+pipefail, safe IFS)Improves robustness and portability.
-#! /bin/bash -set -u -set -e -set -x +#!/usr/bin/env bash +set -euo pipefail +set -x +IFS=$'\n\t'
44-48
: Guard SLURM_NODEID and quote expansionsPrevents unbound var errors and word splitting.
-if [ "${SLURM_NODEID}" = "0" ]; then - mkdir -p ${work_dir}/hostnames/ - echo $(hostname) > ${work_dir}/hostnames/${role}_${instance_id}.txt - echo "hostname saved to ${work_dir}/hostnames/${role}_${instance_id}.txt" +if [ "${SLURM_NODEID:-}" = "0" ]; then + mkdir -p "${work_dir}/hostnames/" + echo "$(hostname)" > "${work_dir}/hostnames/${role}_${instance_id}.txt" + echo "hostname saved to ${work_dir}/hostnames/${role}_${instance_id}.txt" fi
51-54
: Quote command arguments in the non-NSYS pathAvoids word splitting and globbing.
- trtllm-llmapi-launch trtllm-serve ${model_path} --host $(hostname) --port ${port} --extra_llm_api_options ${config_file} + trtllm-llmapi-launch trtllm-serve "${model_path}" --host "$(hostname)" --port "${port}" --extra_llm_api_options "${config_file}"
55-70
: Build NSYS prefix as an array; quote everythingFixes SC2089/SC2090 and makes quoting robust.
- nsys_prefix="" - nsys_file=${nsys_folder}/nsys_worker_proc_${instance_id}_${SLURM_PROCID} + nsys_args=() + nsys_file="${nsys_folder}/nsys_worker_proc_${instance_id}_${SLURM_PROCID}" @@ - nsys_prefix="nsys profile -e \"NSYS_MPI_STORE_TEAMS_PER_RANK=1\" -o ${nsys_file} -f true -t cuda,nvtx,python-gil -c cudaProfilerApi --cuda-graph-trace node --capture-range-end=stop --gpu-metrics-devices=none" - echo "nsys_prefix: ${nsys_prefix}" + nsys_args=(nsys profile -e NSYS_MPI_STORE_TEAMS_PER_RANK=1 -o "${nsys_file}" -f true -t cuda,nvtx,python-gil -c cudaProfilerApi --cuda-graph-trace node --capture-range-end=stop --gpu-metrics-devices=none) + echo "nsys_args: ${nsys_args[*]}" @@ - trtllm-llmapi-launch ${nsys_prefix} \ - trtllm-serve ${model_path} \ - --host $(hostname) --port ${port} \ - --extra_llm_api_options ${config_file} + trtllm-llmapi-launch "${nsys_args[@]}" \ + trtllm-serve "${model_path}" \ + --host "$(hostname)" --port "${port}" \ + --extra_llm_api_options "${config_file}"
31-40
: Check config file exists after derivationFail fast if the YAML is missing.
echo "config_file: ${config_file}" +if [ ! -f "${config_file}" ]; then + echo "Config file not found: ${config_file}" >&2 + exit 1 +fiexamples/disaggregated/slurm/benchmark/disaggr_torch.slurm (2)
10-12
: Enable pipefailPrevents pipeline failures from being masked (e.g., with tee).
set -u set -e +set -o pipefail set -x
171-183
: Bound ctx_nodes slice and add capacity checkAvoids over-slicing and undefined behavior when requested nodes exceed allocation.
all_nodes=($(scontrol show hostname $SLURM_NODELIST | sort)) total_nodes_num=${#all_nodes[@]} echo "all_nodes: ${all_nodes[@]}, total_nodes_num: ${total_nodes_num}" # get the node list for the gen workers total_gen_nodes_num=$((gen_nodes_num * num_gen_servers)) gen_nodes=(${all_nodes[@]:0:${total_gen_nodes_num}}) echo "gen_nodes: ${gen_nodes[@]}, total_gen_nodes_num: ${total_gen_nodes_num}" # get the node list for the ctx workers total_ctx_nodes_num=$((ctx_nodes_num * num_ctx_servers)) -ctx_nodes=(${all_nodes[@]:${total_gen_nodes_num}:${total_nodes_num}}) +if (( total_gen_nodes_num + total_ctx_nodes_num > total_nodes_num )); then + echo "Error: Requested nodes (gen:${total_gen_nodes_num} + ctx:${total_ctx_nodes_num}) exceed allocated nodes (${total_nodes_num})." >&2 + exit 1 +fi +ctx_nodes=(${all_nodes[@]:${total_gen_nodes_num}:${total_ctx_nodes_num}}) echo "ctx_nodes: ${ctx_nodes[@]}, total_ctx_nodes_num: ${total_ctx_nodes_num}"examples/disaggregated/slurm/benchmark/README.md (1)
126-129
: Fix startup order (server before workers) or document worker retry semantics.Current steps launch workers before the server, which can fail without retry/wait logic. Recommend flipping steps 6 and 7 for safer default. If you keep workers-first, please document and implement worker-side retries.
-5. `disaggr_torch.slurm` runs `gen_worker_config.py` to create worker configuration files. -6. `disaggr_torch.slurm` uses `srun` to launch `start_worker.sh` on all nodes, starting the MPI workers for both context and generation phases. -7. `disaggr_torch.slurm` starts the main `trtllm-serve` process using `start_server.sh`, which generates the server configuration using `gen_server_config.py`. +5. `disaggr_torch.slurm` runs `gen_worker_config.py` to create worker configuration files. +6. `disaggr_torch.slurm` starts the main `trtllm-serve` process using `start_server.sh`, which generates the server configuration via `gen_server_config.py`. +7. `disaggr_torch.slurm` uses `srun` to launch `start_worker.sh` on all nodes, starting the MPI workers for both context and generation phases.If you prefer workers-first, add a note after step 7: “Workers retry until the server is reachable (e.g., polling /health).”
🧹 Nitpick comments (12)
examples/disaggregated/slurm/benchmark/gen_server_config.py (3)
32-35
: Exit cleanly on missing work_dirRaise SystemExit for a CLI tool; current ValueError produces a traceback.
- if not os.path.exists(args.work_dir): - raise ValueError(f"Work directory {args.work_dir} not found") + if not os.path.exists(args.work_dir): + raise SystemExit(f"Work directory {args.work_dir} not found")
65-71
: Replace prints with logging for hostname and current hostKeeps logs consistent.
- print(f"ctx_hostnames: {ctx_hostnames}") - print(f"gen_hostnames: {gen_hostnames}") + logger.debug("ctx_hostnames: %s", ctx_hostnames) + logger.debug("gen_hostnames: %s", gen_hostnames) @@ - print(f"Current hostname: {hostname}") + logger.info("Current hostname: %s", hostname)
86-90
: Use UTF-8 encoding and quiet YAML formattingEncoding avoids locale issues; sort_keys=False preserves order; replace print with logging.
- with open(os.path.join(args.work_dir, "server_config.yaml"), "w") as f: - yaml.dump(server_config, f) - print( - f"Server config file {os.path.join(args.work_dir, 'server_config.yaml')} generated" - ) + out_path = os.path.join(args.work_dir, "server_config.yaml") + with open(out_path, "w", encoding="utf-8") as f: + yaml.dump(server_config, f, default_flow_style=False, sort_keys=False) + logger.info("Server config file generated at %s", out_path)examples/disaggregated/slurm/benchmark/gen_worker_config.py (1)
119-124
: Open YAML with UTF-8 and consistent dumper optionsEncoding prevents locale issues.
- with open(moe_load_balancer_file, "w") as f: - yaml.dump(moe_load_balancer_config, - f, - default_flow_style=False, - sort_keys=False) + with open(moe_load_balancer_file, "w", encoding="utf-8") as f: + yaml.dump(moe_load_balancer_config, f, default_flow_style=False, sort_keys=False)examples/disaggregated/slurm/benchmark/disaggr_torch.slurm (1)
187-211
: Consider constraining srun to explicit node listsYou compute gen_nodes/ctx_nodes but do not use them; concurrent sruns may contend on the same nodes. Add --nodelist per group if segregation is required.
Would you like me to propose a per-loop nodelist assignment (partition gen_nodes/ctx_nodes slices per server index)?
examples/disaggregated/slurm/benchmark/run_benchmark.sh (4)
28-44
: Quote config path in wait loopAvoids word-splitting/globbing edge cases.
-while [ ! -f ${config_file} ]; do +while [ ! -f "${config_file}" ]; do
46-54
: Anchor YAML field extraction and quote variablesReduces false matches and parsing errors.
-hostname=$(grep -i "hostname:" ${config_file} | awk '{print $2}') -port=$(grep -i "port:" ${config_file} | awk '{print $2}') +hostname=$(grep -iE '^\s*hostname:\s*' "${config_file}" | awk '{print $2}') +port=$(grep -iE '^\s*port:\s*' "${config_file}" | awk '{print $2}') if [ -z "$hostname" ] || [ -z "$port" ]; then
80-97
: Quote glob loops and outputs in do_get_logsPrevents word splitting and handles no-match safely.
- for gen_file in ${log_path}/output_gen_*.log; do - if [ -f "$gen_file" ]; then + shopt -s nullglob + for gen_file in "${log_path}"/output_gen_*.log; do + if [ -f "${gen_file}" ]; then index=$(basename "$gen_file" | sed 's/output_gen_\(.*\)\.log/\1/') - grep -a "'num_ctx_requests': 0, 'num_ctx_tokens': 0" "$gen_file" > "${output_folder}/gen_only_${index}.txt" || true + grep -a "'num_ctx_requests': 0, 'num_ctx_tokens': 0" "${gen_file}" > "${output_folder}/gen_only_${index}.txt" || true fi done - for ctx_file in ${log_path}/output_ctx_*.log; do - if [ -f "$ctx_file" ]; then + for ctx_file in "${log_path}"/output_ctx_*.log; do + if [ -f "${ctx_file}" ]; then index=$(basename "$ctx_file" | sed 's/output_ctx_\(.*\)\.log/\1/') - grep -a "'num_generation_tokens': 0" "$ctx_file" > "${output_folder}/ctx_only_${index}.txt" || true + grep -a "'num_generation_tokens': 0" "${ctx_file}" > "${output_folder}/ctx_only_${index}.txt" || true fi done
99-123
: Quote paths and args during runs and log aggregationMinor robustness improvements.
- mkdir -p ${log_path}/concurrency_${concurrency} + mkdir -p "${log_path}/concurrency_${concurrency}" @@ - --dataset-path ${shared_gpt_path} \ + --dataset-path "${shared_gpt_path}" \ @@ - do_get_logs ${log_path} ${log_path}/concurrency_${concurrency} - echo "done for ${concurrency} in folder ${log_path}/concurrency_${concurrency}" + do_get_logs "${log_path}" "${log_path}/concurrency_${concurrency}" + echo "done for ${concurrency} in folder ${log_path}/concurrency_${concurrency}"examples/disaggregated/slurm/benchmark/README.md (3)
10-10
: Align description with server-first startup order.Clarify the sequence explicitly to avoid race conditions.
-2. `disaggr_torch.slurm`: The SLURM script that sets up and runs a single benchmark experiment. It launches a container, generates configuration files, starts the server and workers, and runs the benchmark client. +2. `disaggr_torch.slurm`: The SLURM script that sets up and runs a single benchmark experiment. It launches a container, generates configuration files, starts the server (via `start_server.sh`), then launches workers (via `start_worker.sh`), and runs the benchmark client.
14-14
: Grammar nit: add article.-6. `start_server.sh`: A shell script responsible for starting disaggregated server using `trtllm-serve` on each allocated machine. +6. `start_server.sh`: A shell script responsible for starting the disaggregated server using `trtllm-serve` on each allocated machine.
63-70
: Call out deprecation/replacement to prevent confusion.Explicitly note this replaces the previous single-config generator to guide users migrating from older docs/scripts.
This Python script generates the worker configuration YAML file that configures the `trtllm-serve` workers. It creates separate configurations for context and generation workers with different tensor parallelism, batch sizes, and other parameters. +It replaces the previous `gen_yaml.py`-based workflow.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (8)
examples/disaggregated/slurm/benchmark/README.md
(3 hunks)examples/disaggregated/slurm/benchmark/disaggr_torch.slurm
(6 hunks)examples/disaggregated/slurm/benchmark/gen_server_config.py
(1 hunks)examples/disaggregated/slurm/benchmark/gen_worker_config.py
(1 hunks)examples/disaggregated/slurm/benchmark/gen_yaml.py
(0 hunks)examples/disaggregated/slurm/benchmark/run_benchmark.sh
(4 hunks)examples/disaggregated/slurm/benchmark/start_server.sh
(1 hunks)examples/disaggregated/slurm/benchmark/start_worker.sh
(2 hunks)
💤 Files with no reviewable changes (1)
- examples/disaggregated/slurm/benchmark/gen_yaml.py
🚧 Files skipped from review as they are similar to previous changes (1)
- examples/disaggregated/slurm/benchmark/start_server.sh
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py
: Code must target Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Preserve module namespaces when importing; import modules/packages and access members via the module (e.g., from package.subpackage import foo; foo.SomeClass())
Python file names should be snake_case
Python class names should be PascalCase
Python functions/methods and local variables should be snake_case; variables beginning with a number should be prefixed with k_ (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE prefixed with G_ (e.g., G_MY_GLOBAL); constants should be UPPER_SNAKE_CASE
Avoid shadowing variables from outer scopes; initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; comments should be reserved for in-function or file-local interfaces
Use Google-style docstrings for classes and functions; attributes and variables may be documented inline with trailing string literals
Avoid reflection when simpler, explicit code suffices (e.g., avoid dict(**locals()) patterns)
In try/except, catch the narrowest exceptions possible
For duck-typing patterns, keep the try body minimal and move logic to else to avoid masking unrelated failures
Files:
examples/disaggregated/slurm/benchmark/gen_server_config.py
examples/disaggregated/slurm/benchmark/gen_worker_config.py
**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)
Files:
examples/disaggregated/slurm/benchmark/gen_server_config.py
examples/disaggregated/slurm/benchmark/gen_worker_config.py
🧠 Learnings (1)
📚 Learning: 2025-08-25T22:42:47.587Z
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-25T22:42:47.587Z
Learning: Applies to **/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py} : Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)
Applied to files:
examples/disaggregated/slurm/benchmark/gen_worker_config.py
🪛 Ruff (0.12.2)
examples/disaggregated/slurm/benchmark/gen_server_config.py
47-47: Line too long (168 > 120)
(E501)
🪛 Shellcheck (0.10.0)
examples/disaggregated/slurm/benchmark/start_worker.sh
[warning] 46-46: Quote this to prevent word splitting.
(SC2046)
[warning] 53-53: Quote this to prevent word splitting.
(SC2046)
[warning] 61-61: Quotes/backslashes will be treated literally. Use an array.
(SC2089)
[warning] 66-66: Quotes/backslashes in this variable will not be respected.
(SC2090)
[warning] 68-68: Quote this to prevent word splitting.
(SC2046)
🪛 LanguageTool
examples/disaggregated/slurm/benchmark/README.md
[grammar] ~14-~14: There might be a mistake here.
Context: ...ver.sh: A shell script responsible for starting disaggregated server using
trtllm-serv...
(QB_NEW_EN)
[grammar] ~14-~14: There might be a mistake here.
Context: ...trtllm-serveon each allocated machine. 7.
run_benchmark.sh`: A shell script that waits for the serv...
(QB_NEW_EN)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
Signed-off-by: Kaiyu Xie <[email protected]>
/bot skip --comment "the example slurm scripts are not protected by CI pipeline" |
PR_Github #16676 [ skip ] triggered by Bot |
PR_Github #16676 [ skip ] completed with state |
Summary by CodeRabbit
New Features
Improvements
Refactor
Documentation
Description
Test Coverage
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...
Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]
to print this help message.See details below for each supported subcommand.
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]
Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id
(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test
(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast
(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test
(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"
(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"
(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"
(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test
(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test
(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test
(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge
(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"
(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log
(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug
(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-list
parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.md
and the
scripts/test_to_stage_mapping.py
helper.kill
kill
Kill all running builds associated with pull request.
skip
skip --comment COMMENT
Skip testing for latest commit on pull request.
--comment "Reason for skipping build/test"
is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipeline
Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.