-
Notifications
You must be signed in to change notification settings - Fork 1.8k
[None] [doc] Add more documents for large scale EP #7029
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Kaiyu Xie <[email protected]>
📝 WalkthroughWalkthroughAdds Wide Expert Parallelism (WIDEEP) configuration examples and YAML snippet to docs, restructures the wide-EP Load Balancer doc into Online and Offline sections, expands troubleshooting (GB200 NUMA binding and EPLB shared-memory cleanup), updates references and usage links, and adds slurm_scripts to examples list. Changes
Sequence Diagram(s)(omitted — documentation-only changes; no control-flow diagrams applicable) Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Tip 🔌 Remote MCP (Model Context Protocol) integration is now available!Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats. ✨ Finishing Touches🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
Status, Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (3)
examples/wide_ep/README.md (3)
85-87
: Specify a language for the fenced code block (markdownlint MD040)Add a language to satisfy markdownlint and improve rendering.
-``` +```text FileExistsError: [Errno 17] File exists: '/moe_shared_l0_lr0_all'--- `88-89`: **Make cleanup instruction safer and more actionable** Avoid suggesting a blanket wildcard delete without visibility; provide a minimal, interactive workflow and container note. ```diff -you need to manually check `/dev/shm` directory and delete `/dev/shm/moe_shared_*` if any. +Manually inspect the `/dev/shm` directory and remove only the stale MoE shared-memory files if present. + +```bash +# List potential MoE shared memory files +ls -l /dev/shm | grep 'moe_shared_' || echo "No MoE shared memory files found" + +# Remove interactively to avoid accidental deletes +rm -i -- /dev/shm/moe_shared_* +``` + +Note: If running inside a container, perform the cleanup in the container’s `/dev/shm` (not the host).
96-98
: Normalize References formatting: use a subheading instead of a parent list itemRemoves an unnecessary parent list item and aligns structure with common Markdown style.
-- Technical Blog: Scaling Expert Parallelism in TensorRT-LLM - - [Part 1: Design and Implementation of Large-scale EP](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/blogs/tech_blog/blog4_Scaling_Expert_Parallelism_in_TensorRT-LLM.md) - - [Part 2: Performance Status and Optimization](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/blogs/tech_blog/blog8_Scaling_Expert_Parallelism_in_TensorRT-LLM_part2.md) +#### Technical Blog: Scaling Expert Parallelism in TensorRT-LLM +- [Part 1: Design and Implementation of Large-scale EP](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/blogs/tech_blog/blog4_Scaling_Expert_Parallelism_in_TensorRT-LLM.md) +- [Part 2: Performance Status and Optimization](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/blogs/tech_blog/blog8_Scaling_Expert_Parallelism_in_TensorRT-LLM_part2.md)
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
examples/wide_ep/README.md
(1 hunks)
🧰 Additional context used
🪛 LanguageTool
examples/wide_ep/README.md
[grammar] ~82-~82: There might be a mistake here.
Context: ...e stored in shared host memory. 4 ranks on same GB200 node share the same expert w...
(QB_NEW_EN)
[grammar] ~82-~82: Ensure spelling is correct
Context: ...xpert weights to save memory. Normally, these shared host memory will be cleaned up a...
(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)
[grammar] ~82-~82: There might be a mistake here.
Context: ...ed up at process exit, but they may not get chance to be cleaned if an abnormal exi...
(QB_NEW_EN)
[grammar] ~96-~96: There might be a mistake here.
Context: ...aling Expert Parallelism in TensorRT-LLM - [Part 1: Design and Implementation of Lar...
(QB_NEW_EN)
[grammar] ~97-~97: There might be a mistake here.
Context: ...ign and Implementation of Large-scale EP](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/blogs/tech_blog/blog4_Scaling_Expert_Parallelism_in_TensorRT-LLM.md) - [Part 2: Performance Status and Optimizat...
(QB_NEW_EN)
🪛 markdownlint-cli2 (0.17.2)
examples/wide_ep/README.md
85-85: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
Signed-off-by: Kaiyu Xie <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (2)
examples/wide_ep/README.md (2)
80-86
: Clarify NUMA guidance; bind both CPU and memory nodesReword for grammar/clarity and bind both CPU and memory nodes to avoid allocations from GPU NUMA nodes. Add a tip to verify node IDs.
-### GB200 NUMA binding - -GPU memory are also on NUMA nodes on GB200 and system can also use that. Bind memory to CPU nodes to avoid GPU memory being used as host memory. -```bash -numactl -m 0,1 <command> -``` +### GB200 NUMA binding + +GPU memory is also exposed as NUMA nodes on GB200, and the OS may allocate from it. Bind CPU and memory to CPU NUMA nodes to prevent GPU memory from being used as host memory. +```bash +numactl --cpunodebind=0,1 --membind=0,1 <command> +``` +Tip: Use `numactl -H` to list NUMA nodes and verify that `0,1` are CPU nodes on your system.
87-96
: Tighten grammar, heading capitalization, and add code-fence languageImprove readability, fix capitalization, and annotate the fenced block to satisfy markdownlint (MD040). Also clarify the error path context.
-### Shared Memory Clean Up on EPLB - -To achieve online load balance, all expert weights are stored in shared host memory. 4 ranks on same GB200 node share the same expert weights to save memory. Normally, these shared host memory will be cleaned up at process exit, but they may not get chance to be cleaned if an abnormal exit happens. - -In that case, when seeing the following (or similar) error message: -``` -FileExistsError: [Errno 17] File exists: '/moe_shared_l0_lr0_all' -``` -you need to manually check `/dev/shm` directory and delete `/dev/shm/moe_shared_*` if any. +### Shared Memory Cleanup on EPLB + +To enable online load balancing, expert weights are stored in shared host memory. Four ranks on the same GB200 node share the same expert weights to save memory. Normally, this shared memory is cleaned up on process exit, but it may not be removed after an abnormal exit. + +If that happens and you see an error like: +```text +FileExistsError: [Errno 17] File exists: '/moe_shared_l0_lr0_all' +``` +manually check the `/dev/shm` directory and delete `/dev/shm/moe_shared_*` if present. (The example path may appear without the `/dev/shm` prefix in logs.)
🧹 Nitpick comments (7)
examples/wide_ep/README.md (4)
64-66
: Fix heading: use “Troubleshooting”Use the standard single word for consistency with the rest of the docs.
-## Trouble shooting +## Troubleshooting
24-37
: Clarify ‘load_balancer’ forms (inline vs external YAML) and keep examples consistentThe first example shows
load_balancer
as a file path, while later sections show an inline mapping. Please clarify both supported forms and when to use each, or standardize on one form for this doc.An example yaml file to enable wide EP: ```yaml moe_config: backend: WIDEEP max_num_tokens: 9216 - load_balancer: moe_load_balancer.yaml # (optional) enable load balancer + # Load balancer can be specified in two ways: + # 1) As a path to an external YAML file: + # load_balancer: moe_load_balancer.yaml + # 2) Inline as a mapping (see the Online Load Balancer Configuration below). + load_balancer: # (optional) enable load balancer
Parameter Description Default Notes backend
MoE backend type CUTLASS
Set to WIDEEP
to enable wide EPmax_num_tokens
If set, at most max_num_tokens tokens will be sent to torch.ops.trtllm.fused_moe at the same time. None
If the number of tokens exceeds max_num_tokens, the input tensors will be split into chunks and a for loop will be used. - load_balancer
Configuration for MoE load balancing None
+ load_balancer
Configuration for MoE load balancing None
--- `49-53`: **Clarify ‘num_slots’ scope** “Must be ≥ total experts” could be interpreted per-layer or global. Specify whether it’s the total across the entire model or per layer to prevent misconfiguration. --- `58-59`: **Polish the offline/online note** Tighten wording for clarity. ```diff -*Online EP Load Balancer is more suitable for production deployment needs to react timely to the online traffic changes.* +*The Online EP Load Balancer is generally more suitable for production deployments, as it reacts quickly to traffic changes.*
docs/source/deployment-guide/quick-start-recipe-for-deepseek-r1-on-trtllm.md (3)
215-226
: Add hardware/applicability notes and avoid conflicts with earlier FP8 exampleThis section enables WIDEEP (Wide-EP), which per the support matrix applies to GB200 NVL72 with EP>8 and NVFP4. Earlier, the doc set
backend: DEEPGEMM
for FP8; clarify that WIDEEP is not for FP8 and should be used only in the supported scenario to avoid confusion.### Wide Expert Parallelism Add the following fields to the YAML configuration file `/tmp/config.yml` to enable wide EP: ```yaml moe_config: backend: WIDEEP max_num_tokens: 9216 load_balancer: # configure online EP balancer num_slots: 288 layer_updates_per_iter: 1+Note:
+- WIDEEP is currently supported for GB200 NVL72 with EP > 8 and NVFP4 (see the MoE Backend Support Matrix above). It is not available for FP8.
+- If you followed the earlier FP8 example that setsbackend: DEEPGEMM
, switch tobackend: WIDEEP
only when targeting the supported GB200/NVFP4, EP>8 configuration.--- `215-228`: **Explain max_num_tokens discrepancy vs earlier examples** Earlier, you used `max_num_tokens: 3200`. Here, it’s `9216`. Add a brief rationale or guidance so users know which value to pick. ```diff -Add the following fields to the YAML configuration file `/tmp/config.yml` to enable wide EP: +Add the following fields to the YAML configuration file `/tmp/config.yml` to enable wide EP: +<!-- For Wide-EP on GB200/NVL72, a larger `max_num_tokens` (e.g., 9216) is typically viable due to higher capacity; for other setups, use the earlier recommended 3200 or profile accordingly. -->
227-228
: Cross-link offline EP configuration docProvide a quick pointer here as well for completeness.
-Refer to the wide EP [examples](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/wide_ep) for more details. +Refer to the wide EP [examples](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/wide_ep) for more details, and see the [Offline EP Load Balancer](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/wide_ep/ep_load_balancer#offline-ep-load-balancer) documentation if you plan to precompute/avoid online updates.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
docs/source/deployment-guide/quick-start-recipe-for-deepseek-r1-on-trtllm.md
(1 hunks)examples/wide_ep/README.md
(2 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
docs/source/deployment-guide/quick-start-recipe-for-deepseek-r1-on-trtllm.md
🪛 LanguageTool
examples/wide_ep/README.md
[grammar] ~49-~49: There might be a mistake here.
Context: ...ameter | Description | Default | Notes | |-----------|-------------|---------|---...
(QB_NEW_EN)
[grammar] ~50-~50: There might be a mistake here.
Context: ...-------|-------------|---------|-------| | num_slots
| Total number of expert s...
(QB_NEW_EN)
[grammar] ~51-~51: There might be a mistake here.
Context: ...ots | None
| Must be ≥ total experts | | layer_updates_per_iter
| Number of l...
(QB_NEW_EN)
[grammar] ~89-~89: There might be a mistake here.
Context: ...e stored in shared host memory. 4 ranks on same GB200 node share the same expert w...
(QB_NEW_EN)
[grammar] ~89-~89: Ensure spelling is correct
Context: ...xpert weights to save memory. Normally, these shared host memory will be cleaned up a...
(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)
[grammar] ~89-~89: There might be a mistake here.
Context: ...ed up at process exit, but they may not get chance to be cleaned if an abnormal exi...
(QB_NEW_EN)
[grammar] ~103-~103: There might be a mistake here.
Context: ...aling Expert Parallelism in TensorRT-LLM - [Part 1: Design and Implementation of Lar...
(QB_NEW_EN)
[grammar] ~104-~104: There might be a mistake here.
Context: ...ign and Implementation of Large-scale EP](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/blogs/tech_blog/blog4_Scaling_Expert_Parallelism_in_TensorRT-LLM.md) - [Part 2: Performance Status and Optimizat...
(QB_NEW_EN)
🪛 markdownlint-cli2 (0.17.2)
examples/wide_ep/README.md
92-92: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (1)
examples/wide_ep/README.md (1)
62-63
: All referenced paths in examples/wide_ep/README.md are validThe following directories exist and match the links in the README:
examples/wide_ep/slurm_scripts/
examples/disaggregated/slurm/
examples/wide_ep/ep_load_balancer/
No changes needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks.
Signed-off-by: Kaiyu Xie <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (2)
examples/wide_ep/README.md (2)
99-108
: Grammar fixes and clearer heading in EPLB cleanup; add fenced code language (markdownlint MD040).Improve readability and fix capitalization; specify the language for the error snippet.
-### Shared Memory Clean Up on EPLB - -To achieve online load balance, all expert weights are stored in shared host memory. 4 ranks on same GB200 node share the same expert weights to save memory. Normally, these shared host memory will be cleaned up at process exit, but they may not get chance to be cleaned if an abnormal exit happens. - -In that case, when seeing the following (or similar) error message: -``` -FileExistsError: [Errno 17] File exists: '/moe_shared_l0_lr0_all' -``` -you need to manually check `/dev/shm` directory and delete `/dev/shm/moe_shared_*` if any. +### Shared Memory Cleanup on EPLB + +To enable online load balancing, all expert weights are stored in shared host memory. Four ranks on the same GB200 node share the same expert weights to save memory. Normally, this shared host memory is cleaned up on process exit, but it may not be removed after an abnormal exit. + +If that happens and you see an error like: +```text +FileExistsError: [Errno 17] File exists: '/moe_shared_l0_lr0_all' +``` +manually check the `/dev/shm` directory and delete `/dev/shm/moe_shared_*` if present.
92-98
: Bind both CPU and memory NUMA nodes; reword for clarity (and avoid accidental allocation from GPU NUMA nodes).Binding only memory (
-m
) can still allow CPU allocations on GPU NUMA nodes on GB200. Bind CPU and memory nodes together and add a short verification tip.-### GB200 NUMA binding - -GPU memory are also on NUMA nodes on GB200 and system can also use that. Bind memory to CPU nodes to avoid GPU memory being used as host memory. -```bash -numactl -m 0,1 <command> -``` +### GB200 NUMA binding + +GPU memory is also exposed as NUMA nodes on GB200, and the OS may allocate from it. Bind allocations to CPU NUMA nodes to prevent GPU memory from being used as host memory. +```bash +numactl --cpunodebind=0,1 --membind=0,1 <command> +``` +Tip: Use `numactl -H` to list NUMA nodes and verify that `0,1` are CPU nodes on your system.
🧹 Nitpick comments (4)
examples/wide_ep/README.md (4)
22-29
: Fix list style and minor grammar in Prerequisites (markdownlint MD004, phrasing).Use dashes for unordered lists and tighten wording/capitalization.
-### Prerequisites - -* GPU: GB200 NVL72, H20, or RTX PRO 6000 Blackwell Workstation Edition. -* OS: Linux -* Drivers: CUDA Driver 575 or Later -* Docker with NVIDIA Container Toolkit installed -* Python3 and python3-pip (Optional, for accuracy evaluation only) +### Prerequisites + +- GPU: GB200 NVL72, H20, or RTX PRO 6000 Blackwell Workstation Edition. +- OS: Linux +- Drivers: CUDA driver 575 or later +- Docker with NVIDIA Container Toolkit installed +- Python 3 and pip3 (optional; for accuracy evaluation only)
30-33
: Use “set up” (verb), remove bare URL (markdownlint MD034), and clarify wording.This avoids lint warnings and reads more clearly.
-For GB200 NVL72, to make sure that Multi-Node NVLink (MNNVL) is correctly setup, check if the path `/dev/nvidia-caps-imex-channels` exists in the container. If the path doesn't exist, mount it when launching the Docker container. - -For more information on NVIDIA IMEX service for NVLink networks, refer to https://docs.nvidia.com/multi-node-nvlink-systems/imex-guide/overview.html. +For GB200 NVL72, to ensure that Multi-Node NVLink (MNNVL) is correctly set up, check that the path `/dev/nvidia-caps-imex-channels` exists in the container. If it is missing, mount it when launching the Docker container. + +For more information on the NVIDIA IMEX service for NVLink networks, see the NVIDIA Multi-Node NVLink Systems IMEX Guide: <https://docs.nvidia.com/multi-node-nvlink-systems/imex-guide/overview.html>.
50-59
: Clarify whether load_balancer accepts a file path, inline mapping, or both.The earlier example uses a file path (moe_load_balancer.yaml) while this example shows an inline mapping. If both are supported, state it explicitly to avoid confusion; if only one is supported, make the examples consistent.
Would you like me to draft a short “Note:” under the examples that says: “load_balancer can be a path to a YAML file or an inline mapping” with a minimal example of each?
66-71
: Tighten grammar in the production-suitability note.The current sentence is awkward.
-*Online EP Load Balancer is more suitable for production deployment needs to react timely to the online traffic changes.* +*The Online EP Load Balancer is better suited for production deployments because it can react quickly to traffic changes.*
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
examples/wide_ep/README.md
(3 hunks)
🧰 Additional context used
🪛 LanguageTool
examples/wide_ep/README.md
[grammar] ~25-~25: There might be a mistake here.
Context: ...ackwell Workstation Edition. * OS: Linux * Drivers: CUDA Driver 575 or Later * Dock...
(QB_NEW_EN)
[grammar] ~26-~26: There might be a mistake here.
Context: ...inux * Drivers: CUDA Driver 575 or Later * Docker with NVIDIA Container Toolkit ins...
(QB_NEW_EN)
[grammar] ~27-~27: There might be a mistake here.
Context: ... with NVIDIA Container Toolkit installed * Python3 and python3-pip (Optional, for a...
(QB_NEW_EN)
[grammar] ~61-~61: There might be a mistake here.
Context: ...ameter | Description | Default | Notes | |-----------|-------------|---------|---...
(QB_NEW_EN)
[grammar] ~62-~62: There might be a mistake here.
Context: ...-------|-------------|---------|-------| | num_slots
| Total number of expert s...
(QB_NEW_EN)
[grammar] ~63-~63: There might be a mistake here.
Context: ...ots | None
| Must be ≥ total experts | | layer_updates_per_iter
| Number of l...
(QB_NEW_EN)
[grammar] ~101-~101: There might be a mistake here.
Context: ...e stored in shared host memory. 4 ranks on same GB200 node share the same expert w...
(QB_NEW_EN)
[grammar] ~101-~101: Ensure spelling is correct
Context: ...xpert weights to save memory. Normally, these shared host memory will be cleaned up a...
(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)
[grammar] ~101-~101: There might be a mistake here.
Context: ...ed up at process exit, but they may not get chance to be cleaned if an abnormal exi...
(QB_NEW_EN)
[grammar] ~115-~115: There might be a mistake here.
Context: ...aling Expert Parallelism in TensorRT-LLM - [Part 1: Design and Implementation of Lar...
(QB_NEW_EN)
[grammar] ~116-~116: There might be a mistake here.
Context: ...ign and Implementation of Large-scale EP](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/blogs/tech_blog/blog4_Scaling_Expert_Parallelism_in_TensorRT-LLM.md) - [Part 2: Performance Status and Optimizat...
(QB_NEW_EN)
🪛 markdownlint-cli2 (0.17.2)
examples/wide_ep/README.md
24-24: Unordered list style
Expected: dash; Actual: asterisk
(MD004, ul-style)
25-25: Unordered list style
Expected: dash; Actual: asterisk
(MD004, ul-style)
26-26: Unordered list style
Expected: dash; Actual: asterisk
(MD004, ul-style)
27-27: Unordered list style
Expected: dash; Actual: asterisk
(MD004, ul-style)
28-28: Unordered list style
Expected: dash; Actual: asterisk
(MD004, ul-style)
32-32: Bare URL used
(MD034, no-bare-urls)
104-104: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
Signed-off-by: Kaiyu Xie <[email protected]>
/bot skip --comment "doc changes" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (2)
examples/wide_ep/README.md (2)
92-98
: Clarify NUMA guidance, bind both CPU and memory nodes, and fix grammar.Bind CPU and memory nodes together to avoid accidental allocation from GPU NUMA nodes. Add a tip to verify node IDs.
-### GB200 NUMA binding - -GPU memory are also on NUMA nodes on GB200 and system can also use that. Bind memory to CPU nodes to avoid GPU memory being used as host memory. -```bash -numactl -m 0,1 <command> -``` +### GB200 NUMA binding + +GPU memory is also exposed as NUMA nodes on GB200, and the OS may allocate from it. Bind to CPU NUMA nodes for both CPU and memory to prevent GPU memory from being used as host memory. +```bash +numactl --cpunodebind=0,1 --membind=0,1 <command> +``` +Tip: Run `numactl -H` to list NUMA nodes and confirm that `0,1` are CPU nodes on your system.
99-108
: Grammar fixes and clearer heading in EPLB cleanup section; add code fence language (MD040).Tighten language, correct capitalization, and specify code block language.
-### Shared Memory Clean Up on EPLB - -To achieve online load balance, all expert weights are stored in shared host memory. 4 ranks on same GB200 node share the same expert weights to save memory. Normally, these shared host memory will be cleaned up at process exit, but they may not get chance to be cleaned if an abnormal exit happens. - -In that case, when seeing the following (or similar) error message: -``` -FileExistsError: [Errno 17] File exists: '/moe_shared_l0_lr0_all' -``` -you need to manually check `/dev/shm` directory and delete `/dev/shm/moe_shared_*` if any. +### Shared Memory Cleanup on EPLB + +To enable online load balancing, expert weights are stored in shared host memory. Four ranks on the same GB200 node share the same expert weights to save memory. Normally, this shared memory is cleaned up on process exit, but it may not be removed after an abnormal exit. + +If that happens and you see an error like: +```text +FileExistsError: [Errno 17] File exists: '/moe_shared_l0_lr0_all' +``` +manually check the `/dev/shm` directory and delete any `/dev/shm/moe_shared_*` entries.
🧹 Nitpick comments (7)
examples/wide_ep/README.md (7)
24-29
: Tighten prerequisites wording and align list style with markdownlint (MD004).Use dashes, fix capitalization, and avoid trailing periods for consistency.
-* GPU: GB200 NVL72, H20, or RTX 6000D. -* OS: Linux -* Drivers: CUDA Driver 575 or Later -* Docker with NVIDIA Container Toolkit installed -* Python3 and python3-pip (Optional, for accuracy evaluation only) +- GPU: GB200 NVL72, H20, or RTX 6000D +- OS: Linux +- Driver: NVIDIA CUDA driver 575 or later +- Docker with NVIDIA Container Toolkit installed +- Python 3 and pip (optional; for accuracy evaluation only)
30-33
: Fix “set up”, add a concrete mount example, and replace bare URL with a titled link (MD034).Small grammar improvement and actionable example reduce ambiguity; link text improves readability.
-For GB200 NVL72, to make sure that Multi-Node NVLink (MNNVL) is correctly setup, check if the path `/dev/nvidia-caps-imex-channels` exists in the container. If the path doesn't exist, mount it when launching the Docker container. - -For more information on NVIDIA IMEX service for NVLink networks, refer to https://docs.nvidia.com/multi-node-nvlink-systems/imex-guide/overview.html. +For GB200 NVL72, to ensure Multi-Node NVLink (MNNVL) is correctly set up, check whether `/dev/nvidia-caps-imex-channels` exists inside the container. If it does not exist, bind-mount it when launching the Docker container. For example: +```bash +docker run --gpus all --rm -it \ + --net=host --ipc=host \ + -v /dev/nvidia-caps-imex-channels:/dev/nvidia-caps-imex-channels:ro \ + <image> ... +``` + +For more information on the NVIDIA IMEX service for NVLink networks, see the [IMEX Guide](https://docs.nvidia.com/multi-node-nvlink-systems/imex-guide/overview.html).
36-42
: Clarify whetherload_balancer
accepts a file path and/or inline object; avoid mixed examples without context.The section below shows inline fields, while this snippet uses a file path. Confirm both forms are supported, and make the comment explicit.
- load_balancer: moe_load_balancer.yaml # (optional) enable load balancer + load_balancer: moe_load_balancer.yaml # optional: path to a YAML config (see below for inline option)If only one form is supported, align both examples accordingly. If both are supported, consider adding brief “Option A (file)” / “Option B (inline)” subheadings.
66-71
: Reword the offline/online guidance for clarity.Streamline the sentence and fix grammar.
-*Online EP Load Balancer is more suitable for production deployment needs to react timely to the online traffic changes.* +*The Online EP Load Balancer is more suitable for production deployments because it reacts promptly to traffic changes.*
76-76
: Fix heading typo: “Trouble shooting” → “Troubleshooting”.Improves professionalism in user-facing docs.
-## Trouble shooting +## Troubleshooting
80-90
: Transparent Huge Pages: tighten wording and add a sudo-safe command.
- Use “Transparent Huge Pages (THP)” and correct subject-verb agreement.
- Remove shell prompt markers in code blocks.
- Use tee with sudo to avoid redirection permission issues.
-When getting exception `madvise(MADV_HUGEPAGE) failed.`, check if Transparent Hugepages has been enabled. +If you see `madvise(MADV_HUGEPAGE) failed.`, verify that Transparent Huge Pages (THP) are enabled. ```bash ->$ cat /sys/kernel/mm/transparent_hugepage/enabled +cat /sys/kernel/mm/transparent_hugepage/enabled always [madvise] never ->$ cat /sys/kernel/mm/transparent_hugepage/defrag +cat /sys/kernel/mm/transparent_hugepage/defrag always defer defer+madvise [madvise] never-If
never
is highlighted, enable Transparent HugePages by the following command.
+Ifnever
is highlighted, enable Transparent Huge Pages with:-echo madvise > /sys/kernel/mm/transparent_hugepage/enabled +echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled >/dev/null--- `61-65`: **Minor table clarity: specify “number of experts”.** Small wording tweak improves precision. ```diff -| `num_slots` | Total number of expert slots | `None` | Must be ≥ total experts | +| `num_slots` | Total number of expert slots | `None` | Must be ≥ total number of experts |
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
examples/wide_ep/README.md
(3 hunks)
🧰 Additional context used
🪛 LanguageTool
examples/wide_ep/README.md
[grammar] ~25-~25: There might be a mistake here.
Context: ...00 NVL72, H20, or RTX 6000D. * OS: Linux * Drivers: CUDA Driver 575 or Later * Dock...
(QB_NEW_EN)
[grammar] ~26-~26: There might be a mistake here.
Context: ...inux * Drivers: CUDA Driver 575 or Later * Docker with NVIDIA Container Toolkit ins...
(QB_NEW_EN)
[grammar] ~27-~27: There might be a mistake here.
Context: ... with NVIDIA Container Toolkit installed * Python3 and python3-pip (Optional, for a...
(QB_NEW_EN)
[grammar] ~61-~61: There might be a mistake here.
Context: ...ameter | Description | Default | Notes | |-----------|-------------|---------|---...
(QB_NEW_EN)
[grammar] ~62-~62: There might be a mistake here.
Context: ...-------|-------------|---------|-------| | num_slots
| Total number of expert s...
(QB_NEW_EN)
[grammar] ~63-~63: There might be a mistake here.
Context: ...ots | None
| Must be ≥ total experts | | layer_updates_per_iter
| Number of l...
(QB_NEW_EN)
[grammar] ~101-~101: There might be a mistake here.
Context: ...e stored in shared host memory. 4 ranks on same GB200 node share the same expert w...
(QB_NEW_EN)
[grammar] ~101-~101: Ensure spelling is correct
Context: ...xpert weights to save memory. Normally, these shared host memory will be cleaned up a...
(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)
[grammar] ~101-~101: There might be a mistake here.
Context: ...ed up at process exit, but they may not get chance to be cleaned if an abnormal exi...
(QB_NEW_EN)
[grammar] ~115-~115: There might be a mistake here.
Context: ...aling Expert Parallelism in TensorRT-LLM - [Part 1: Design and Implementation of Lar...
(QB_NEW_EN)
[grammar] ~116-~116: There might be a mistake here.
Context: ...ign and Implementation of Large-scale EP](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/blogs/tech_blog/blog4_Scaling_Expert_Parallelism_in_TensorRT-LLM.md) - [Part 2: Performance Status and Optimizat...
(QB_NEW_EN)
🪛 markdownlint-cli2 (0.17.2)
examples/wide_ep/README.md
24-24: Unordered list style
Expected: dash; Actual: asterisk
(MD004, ul-style)
25-25: Unordered list style
Expected: dash; Actual: asterisk
(MD004, ul-style)
26-26: Unordered list style
Expected: dash; Actual: asterisk
(MD004, ul-style)
27-27: Unordered list style
Expected: dash; Actual: asterisk
(MD004, ul-style)
28-28: Unordered list style
Expected: dash; Actual: asterisk
(MD004, ul-style)
32-32: Bare URL used
(MD034, no-bare-urls)
104-104: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (3)
examples/wide_ep/README.md (3)
50-59
: Online LB YAML example reads well.Inline config is clear and consistent with the table below.
72-75
: SLURM guidance looks good.Linking to slurm_scripts and disaggregated scripts is helpful for users.
115-117
: References section looks good.Links and labels are clear and helpful.
PR_Github #15768 [ skip ] triggered by Bot |
PR_Github #15768 [ skip ] completed with state |
Summary by CodeRabbit
Description
Test Coverage
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...
Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]
to print this help message.See details below for each supported subcommand.
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]
Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id
(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test
(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast
(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test
(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"
(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"
(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"
(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test
(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test
(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test
(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge
(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"
(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log
(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug
(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-list
parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.md
and the
scripts/test_to_stage_mapping.py
helper.kill
kill
Kill all running builds associated with pull request.
skip
skip --comment COMMENT
Skip testing for latest commit on pull request.
--comment "Reason for skipping build/test"
is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipeline
Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.