Skip to content

Conversation

kaiyux
Copy link
Member

@kaiyux kaiyux commented Jul 31, 2025

Summary by CodeRabbit

  • New Features

    • Expanded configuration options for sequence length limits, GPU memory allocation, and cache token limits in server setup.
    • Enhanced configurability for backend selection and allreduce strategy based on server parameters.
    • Added support for multiple SLURM job submissions with varied slot configurations and enhanced job submission parameters.
  • Improvements

    • More detailed log directory naming for easier tracking of server and benchmarking parameters.
    • Scripts now print additional information about concurrency and log directories for improved clarity.
    • Extended runtime duration for server processes.
    • Enhanced SLURM job submission scripts with additional options and improved parameterization.
  • Bug Fixes

    • Improved hostname detection logic in server startup scripts for broader compatibility.
  • Chores

    • Updated script argument handling and removed unused environment variables for cleaner execution.

Description

Test Coverage

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

Copy link
Contributor

coderabbitai bot commented Jul 31, 2025

📝 Walkthrough

Walkthrough

The changes update and expand parameterization, configuration, and script logic for running disaggregated server workloads in a SLURM environment. This includes restructuring argument handling, increasing configurability for sequence lengths and memory, refining YAML generation, updating how hostnames are matched, and removing concurrency from worker scripts.

Changes

Cohort / File(s) Change Summary
SLURM Example Script Enhancements
examples/disaggregated/slurm/disaggr_torch.slurm
Argument handling was restructured and expanded from 13 to 22 parameters, grouping them by server type and function. New variables for max sequence lengths, GPU memory fraction, and cache token limits were added. The container name and log directory naming scheme were updated. The script now prints additional info, updates PDL logic, and modifies worker and YAML generation commands. The process-killing command was removed, and echo statements were added.
YAML Generation Configurability
examples/disaggregated/slurm/gen_yaml.py
The YAML config generator function and CLI were extended to accept and process new parameters for max sequence lengths, GPU memory fractions, and cache transceiver buffer size. Backend and allreduce strategy logic were updated to be conditional on tensor parallel size and attention DP. The function signature and CLI interface were both updated to reflect these changes.
Server Start Script Logic
examples/disaggregated/slurm/start_server.sh
Hostname extraction logic was broadened to match any line containing "hostname:". The runtime parameter for trtllm-serve was increased from 1800 to 7200 seconds.
Worker Start Script Parameter Simplification
examples/disaggregated/slurm/start_worker.sh
The concurrency parameter was removed, reducing the argument count from five to four. Associated environment variable exports were deleted. Argument handling, echo statements, and a string-to-numeric comparison in a conditional were updated accordingly.
SLURM Job Submission Script Enhancements
examples/disaggregated/slurm/submit.sh, examples/wide_ep/slurm_scripts/submit.sh
Both submission scripts were enhanced with new variables for flags and paths (e.g., isl, osl, multi_round, streaming, container_image, mounts, workdir, model_dir). SLURM sbatch commands were extended with additional options (--partition, --account, --job-name, --gres=gpu:${ntasks_per_node}) and expanded argument lists passed to the SLURM workload script. The wide_ep script added nested loops to submit multiple jobs with varying eplb_num_slots. Echo reminders about SLURM parameters were removed.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant SLURM Script
    participant gen_yaml.py
    participant start_server.sh
    participant start_worker.sh

    User->>SLURM Script: Submit job with expanded parameters
    SLURM Script->>gen_yaml.py: Generate YAML config with new sequence/memory params
    SLURM Script->>start_server.sh: Start server with updated hostname logic and runtime
    SLURM Script->>start_worker.sh: Start workers (without concurrency param)
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested reviewers

  • Shixiaowei02
  • pcastonguay
  • yizhang-nv

Note

⚡️ Unit Test Generation is now available in beta!

Learn more here, or try it out under "Finishing Touches" below.

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai generate unit tests to generate unit tests for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@kaiyux kaiyux marked this pull request as draft July 31, 2025 15:53
kaiyux added 2 commits July 31, 2025 09:15
Signed-off-by: Kaiyu Xie <[email protected]>
Signed-off-by: Kaiyu Xie <[email protected]>
@kaiyux kaiyux marked this pull request as ready for review July 31, 2025 16:15
@kaiyux kaiyux force-pushed the user/kaiyu/update_slurm_scripts branch from 0ae9c77 to 301bde6 Compare July 31, 2025 16:15
@kaiyux kaiyux requested review from qiaoxj07 and pcastonguay July 31, 2025 16:16
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (1)
examples/disaggregated/slurm/gen_yaml.py (1)

230-230: Consider making stream_interval configurable.

The stream_interval is hardcoded to 20. Consider making this a configurable parameter for consistency with other settings.

+    stream_interval: int = 20,

And in the function call:

-            'stream_interval': 20,
+            'stream_interval': stream_interval,
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0ae9c77 and 301bde6.

📒 Files selected for processing (6)
  • examples/disaggregated/slurm/disaggr_torch.slurm (4 hunks)
  • examples/disaggregated/slurm/gen_yaml.py (9 hunks)
  • examples/disaggregated/slurm/start_server.sh (2 hunks)
  • examples/disaggregated/slurm/start_worker.sh (2 hunks)
  • examples/disaggregated/slurm/submit.sh (1 hunks)
  • examples/wide_ep/slurm_scripts/submit.sh (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
  • examples/disaggregated/slurm/start_server.sh
  • examples/disaggregated/slurm/start_worker.sh
  • examples/disaggregated/slurm/disaggr_torch.slurm
🧰 Additional context used
🪛 Shellcheck (0.10.0)
examples/disaggregated/slurm/submit.sh

[error] 13-13: Couldn't parse this variable assignment. Fix to allow more checks.

(SC1073)


[error] 13-13: Fix any mentioned problems and try again.

(SC1072)

examples/wide_ep/slurm_scripts/submit.sh

[error] 12-12: Couldn't parse this variable assignment. Fix to allow more checks.

(SC1073)


[error] 12-12: Fix any mentioned problems and try again.

(SC1072)

🔇 Additional comments (10)
examples/disaggregated/slurm/gen_yaml.py (5)

128-129: LGTM! Enhanced parameterization improves configurability.

The addition of configurable sequence length parameters (ctx_max_seq_len, gen_max_seq_len) and cache transceiver buffer size (cache_transceiver_max_num_tokens) replaces hardcoded values, making the configuration more flexible.

Also applies to: 135-135, 142-142


170-175: Verify the backend selection logic conditions.

The backend selection logic has specific conditions for gen_moe_backend:

  • Line 171-172: Uses "WIDEEP" when gen_tp_size >= 16 AND gen_enable_attention_dp is true
  • Line 173-174: Uses "TRTLLM" when gen_enable_attention_dp is false (regardless of tp_size)

This means for gen_tp_size >= 16 with attention DP disabled, "TRTLLM" takes precedence over "WIDEEP". Please confirm this logic aligns with the intended backend selection strategy.


185-186: LGTM! Consistent use of configurable memory fractions.

The changes properly replace hardcoded memory fraction values with the new configurable parameters in both context and generation server configurations, including their respective kv_cache_config sections.

Also applies to: 195-196, 211-212, 219-219, 227-227


254-256: LGTM! Appropriate allreduce strategy configuration.

The conditional addition of allreduce_strategy for specific tensor parallel configurations (gen_tp_size == 8 and not gen_enable_attention_dp) is well-targeted and follows the pattern of conditional configuration.


311-318: LGTM! Comprehensive argument parsing for new parameters.

The command-line interface properly adds all new required parameters with appropriate help text and passes them correctly to the gen_config_file function.

Also applies to: 339-342, 367-370, 376-383

examples/disaggregated/slurm/submit.sh (2)

5-6: Variable reordering and new configuration variables look good.

The repositioning of ntasks_per_node, ctx_num, and concurrency variables, along with the addition of new configuration variables (isl, osl, multi_round, streaming, etc.) enhances the script's configurability. However, ensure all placeholder values are properly set before use.

Also applies to: 7-16


29-41: LGTM! Well-organized argument structure.

The reorganized argument structure with clear groupings (Context servers, Generation servers, Other arguments, Benchmarking arguments, User specific arguments) makes the script more maintainable and readable.

examples/wide_ep/slurm_scripts/submit.sh (3)

8-15: LGTM! New configuration variables enhance flexibility.

The addition of configuration variables (isl, osl, multi_round, streaming, etc.) provides better parameterization for the SLURM jobs.


18-46: Well-structured nested loop for batch size and slot configurations.

The nested loops provide comprehensive testing across different batch sizes (1, 64, 1024) and eplb_num_slots values (0, 256, 288). The dynamic calculation of concurrency, ctx_num, and total_node_num based on batch size is appropriate.


49-75: LGTM! Consistent pattern for dep32 configuration.

The second loop for batch size 512 with dep32 configuration follows the same well-structured pattern as the first loop, maintaining consistency in the script design.

@kaiyux kaiyux changed the title chore: Update SLURM scripts chore: Make example SLURM scripts more parameterized Jul 31, 2025
Signed-off-by: Kaiyu Xie <[email protected]>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (3)
examples/disaggregated/slurm/submit.sh (1)

42-46: Repeated: validate required SLURM variables before sbatch
Same concern as in the previous review – if ${partition}, ${account} or ${job_name} are still empty the job submission will fail at runtime. Add explicit checks ([[ -z … ]]) or abort early.

examples/wide_ep/slurm_scripts/submit.sh (2)

5-11: Shell syntax error – placeholders must be quoted or replaced
Same parsing issue as the other script; quote or default-initialize these variables.


80-88: Strict mode & validation still missing
Add set -euo pipefail at top and pre-flight checks for ${partition}, ${account}, ${job_name} – see earlier suggestion.

🧹 Nitpick comments (3)
examples/disaggregated/slurm/submit.sh (2)

21-34: Quote expansions inside the args array to preserve spaces/colons
$mounts, $workdir, $model_dir can all contain path separators or spaces. Unquoted expansions inside an array will word-split. Safer:

-    $container_image
-    $mounts
-    $workdir
-    $model_dir
+    "${container_image}"
+    "${mounts}"
+    "${workdir}"
+    "${model_dir}"

1-2: Enable strict mode for safer scripting
Consider set -euo pipefail right after the shebang to fail fast on unset variables, non-zero exits and pipeline errors.

examples/wide_ep/slurm_scripts/submit.sh (1)

29-42: Quote variable expansions in args array
Same word-splitting risk as in the other script – wrap ${…} in double quotes.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 301bde6 and f62d449.

📒 Files selected for processing (2)
  • examples/disaggregated/slurm/submit.sh (1 hunks)
  • examples/wide_ep/slurm_scripts/submit.sh (2 hunks)
🧰 Additional context used
🪛 Shellcheck (0.10.0)
examples/wide_ep/slurm_scripts/submit.sh

[error] 5-5: Couldn't parse this variable assignment. Fix to allow more checks.

(SC1073)


[error] 5-5: Fix any mentioned problems and try again.

(SC1072)

examples/disaggregated/slurm/submit.sh

[error] 3-3: Couldn't parse this variable assignment. Fix to allow more checks.

(SC1073)


[error] 3-3: Fix any mentioned problems and try again.

(SC1072)

@kaiyux
Copy link
Member Author

kaiyux commented Aug 1, 2025

/bot skip --comment "the scripts are currently not protected by CI pipeline"

@tensorrt-cicd
Copy link
Collaborator

PR_Github #13759 [ skip ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #13759 [ skip ] completed with state SUCCESS
Skipping testing for commit f62d449

@kaiyux kaiyux merged commit aee35e2 into NVIDIA:main Aug 1, 2025
3 checks passed
@kaiyux kaiyux deleted the user/kaiyu/update_slurm_scripts branch August 1, 2025 04:53
lancelly pushed a commit to lancelly/TensorRT-LLM that referenced this pull request Aug 6, 2025
jain-ria pushed a commit to jain-ria/TensorRT-LLM that referenced this pull request Aug 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants