-
Notifications
You must be signed in to change notification settings - Fork 1.8k
chore: Make example SLURM scripts more parameterized #6511
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
📝 WalkthroughWalkthroughThe changes update and expand parameterization, configuration, and script logic for running disaggregated server workloads in a SLURM environment. This includes restructuring argument handling, increasing configurability for sequence lengths and memory, refining YAML generation, updating how hostnames are matched, and removing concurrency from worker scripts. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant SLURM Script
participant gen_yaml.py
participant start_server.sh
participant start_worker.sh
User->>SLURM Script: Submit job with expanded parameters
SLURM Script->>gen_yaml.py: Generate YAML config with new sequence/memory params
SLURM Script->>start_server.sh: Start server with updated hostname logic and runtime
SLURM Script->>start_worker.sh: Start workers (without concurrency param)
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Suggested reviewers
Note ⚡️ Unit Test Generation is now available in beta!Learn more here, or try it out under "Finishing Touches" below. ✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
Signed-off-by: Kaiyu Xie <[email protected]>
Signed-off-by: Kaiyu Xie <[email protected]>
0ae9c77
to
301bde6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🧹 Nitpick comments (1)
examples/disaggregated/slurm/gen_yaml.py (1)
230-230
: Consider making stream_interval configurable.The
stream_interval
is hardcoded to 20. Consider making this a configurable parameter for consistency with other settings.+ stream_interval: int = 20,
And in the function call:
- 'stream_interval': 20, + 'stream_interval': stream_interval,
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
examples/disaggregated/slurm/disaggr_torch.slurm
(4 hunks)examples/disaggregated/slurm/gen_yaml.py
(9 hunks)examples/disaggregated/slurm/start_server.sh
(2 hunks)examples/disaggregated/slurm/start_worker.sh
(2 hunks)examples/disaggregated/slurm/submit.sh
(1 hunks)examples/wide_ep/slurm_scripts/submit.sh
(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
- examples/disaggregated/slurm/start_server.sh
- examples/disaggregated/slurm/start_worker.sh
- examples/disaggregated/slurm/disaggr_torch.slurm
🧰 Additional context used
🪛 Shellcheck (0.10.0)
examples/disaggregated/slurm/submit.sh
[error] 13-13: Couldn't parse this variable assignment. Fix to allow more checks.
(SC1073)
[error] 13-13: Fix any mentioned problems and try again.
(SC1072)
examples/wide_ep/slurm_scripts/submit.sh
[error] 12-12: Couldn't parse this variable assignment. Fix to allow more checks.
(SC1073)
[error] 12-12: Fix any mentioned problems and try again.
(SC1072)
🔇 Additional comments (10)
examples/disaggregated/slurm/gen_yaml.py (5)
128-129
: LGTM! Enhanced parameterization improves configurability.The addition of configurable sequence length parameters (
ctx_max_seq_len
,gen_max_seq_len
) and cache transceiver buffer size (cache_transceiver_max_num_tokens
) replaces hardcoded values, making the configuration more flexible.Also applies to: 135-135, 142-142
170-175
: Verify the backend selection logic conditions.The backend selection logic has specific conditions for
gen_moe_backend
:
- Line 171-172: Uses "WIDEEP" when
gen_tp_size >= 16
ANDgen_enable_attention_dp
is true- Line 173-174: Uses "TRTLLM" when
gen_enable_attention_dp
is false (regardless of tp_size)This means for
gen_tp_size >= 16
with attention DP disabled, "TRTLLM" takes precedence over "WIDEEP". Please confirm this logic aligns with the intended backend selection strategy.
185-186
: LGTM! Consistent use of configurable memory fractions.The changes properly replace hardcoded memory fraction values with the new configurable parameters in both context and generation server configurations, including their respective kv_cache_config sections.
Also applies to: 195-196, 211-212, 219-219, 227-227
254-256
: LGTM! Appropriate allreduce strategy configuration.The conditional addition of
allreduce_strategy
for specific tensor parallel configurations (gen_tp_size == 8
andnot gen_enable_attention_dp
) is well-targeted and follows the pattern of conditional configuration.
311-318
: LGTM! Comprehensive argument parsing for new parameters.The command-line interface properly adds all new required parameters with appropriate help text and passes them correctly to the
gen_config_file
function.Also applies to: 339-342, 367-370, 376-383
examples/disaggregated/slurm/submit.sh (2)
5-6
: Variable reordering and new configuration variables look good.The repositioning of
ntasks_per_node
,ctx_num
, andconcurrency
variables, along with the addition of new configuration variables (isl
,osl
,multi_round
,streaming
, etc.) enhances the script's configurability. However, ensure all placeholder values are properly set before use.Also applies to: 7-16
29-41
: LGTM! Well-organized argument structure.The reorganized argument structure with clear groupings (Context servers, Generation servers, Other arguments, Benchmarking arguments, User specific arguments) makes the script more maintainable and readable.
examples/wide_ep/slurm_scripts/submit.sh (3)
8-15
: LGTM! New configuration variables enhance flexibility.The addition of configuration variables (
isl
,osl
,multi_round
,streaming
, etc.) provides better parameterization for the SLURM jobs.
18-46
: Well-structured nested loop for batch size and slot configurations.The nested loops provide comprehensive testing across different batch sizes (1, 64, 1024) and
eplb_num_slots
values (0, 256, 288). The dynamic calculation ofconcurrency
,ctx_num
, andtotal_node_num
based on batch size is appropriate.
49-75
: LGTM! Consistent pattern for dep32 configuration.The second loop for batch size 512 with dep32 configuration follows the same well-structured pattern as the first loop, maintaining consistency in the script design.
Signed-off-by: Kaiyu Xie <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (3)
examples/disaggregated/slurm/submit.sh (1)
42-46
: Repeated: validate required SLURM variables beforesbatch
Same concern as in the previous review – if${partition}
,${account}
or${job_name}
are still empty the job submission will fail at runtime. Add explicit checks ([[ -z … ]]
) or abort early.examples/wide_ep/slurm_scripts/submit.sh (2)
5-11
: Shell syntax error – placeholders must be quoted or replaced
Same parsing issue as the other script; quote or default-initialize these variables.
80-88
: Strict mode & validation still missing
Addset -euo pipefail
at top and pre-flight checks for${partition}
,${account}
,${job_name}
– see earlier suggestion.
🧹 Nitpick comments (3)
examples/disaggregated/slurm/submit.sh (2)
21-34
: Quote expansions inside theargs
array to preserve spaces/colons
$mounts
,$workdir
,$model_dir
can all contain path separators or spaces. Unquoted expansions inside an array will word-split. Safer:- $container_image - $mounts - $workdir - $model_dir + "${container_image}" + "${mounts}" + "${workdir}" + "${model_dir}"
1-2
: Enable strict mode for safer scripting
Considerset -euo pipefail
right after the shebang to fail fast on unset variables, non-zero exits and pipeline errors.examples/wide_ep/slurm_scripts/submit.sh (1)
29-42
: Quote variable expansions inargs
array
Same word-splitting risk as in the other script – wrap${…}
in double quotes.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
examples/disaggregated/slurm/submit.sh
(1 hunks)examples/wide_ep/slurm_scripts/submit.sh
(2 hunks)
🧰 Additional context used
🪛 Shellcheck (0.10.0)
examples/wide_ep/slurm_scripts/submit.sh
[error] 5-5: Couldn't parse this variable assignment. Fix to allow more checks.
(SC1073)
[error] 5-5: Fix any mentioned problems and try again.
(SC1072)
examples/disaggregated/slurm/submit.sh
[error] 3-3: Couldn't parse this variable assignment. Fix to allow more checks.
(SC1073)
[error] 3-3: Fix any mentioned problems and try again.
(SC1072)
/bot skip --comment "the scripts are currently not protected by CI pipeline" |
PR_Github #13759 [ skip ] triggered by Bot |
PR_Github #13759 [ skip ] completed with state |
Signed-off-by: Kaiyu Xie <[email protected]> Signed-off-by: Lanyu Liao <[email protected]>
Signed-off-by: Kaiyu Xie <[email protected]>
Summary by CodeRabbit
New Features
Improvements
Bug Fixes
Chores
Description
Test Coverage
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...
Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]
to print this help message.See details below for each supported subcommand.
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]
Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id
(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test
(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast
(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test
(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"
(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"
(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"
(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test
(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test
(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test
(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge
(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"
(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log
(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug
(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-list
parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.md
and the
scripts/test_to_stage_mapping.py
helper.kill
kill
Kill all running builds associated with pull request.
skip
skip --comment COMMENT
Skip testing for latest commit on pull request.
--comment "Reason for skipping build/test"
is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipeline
Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.