Add bf16/fp32 token-per-expert to the MoE aux loss kernel #2162

Autumn1998 · 2025-09-08T07:07:25Z

Description

Add the new datatype(1. bf16, 2. fp32) support on the token-per-expert on the moe-aux-loss-computation kernel.

Fixes # (issue)

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Change A
Change B

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

…fusion Signed-off-by: tongliu <[email protected]>

for more information, see https://pre-commit.ci

yaox12 · 2025-09-08T07:10:58Z

/te-ci

Signed-off-by: vthumbe1503 <[email protected]> Add bf16/fp32 token-per-expert to the MoE aux loss kernel (NVIDIA#2162) * add bf16/fp32 token-per-expert on the moe-loss-computation on router fusion Signed-off-by: tongliu <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: tongliu <[email protected]> Co-authored-by: tongliu <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> [JAX] Scale swizzling via JAX transpose op (NVIDIA#2163) * add swizzle in jax Signed-off-by: Phuong Nguyen <[email protected]> * added outer_impl Signed-off-by: Phuong Nguyen <[email protected]> * clean up FFI Signed-off-by: Phuong Nguyen <[email protected]> --------- Signed-off-by: Phuong Nguyen <[email protected]> Extract cpp distributed tests into a separate project (NVIDIA#2165) * Extract cpp distributed tests into a separate project Signed-off-by: Vladimir Cherepanov <[email protected]> * Remove obsolete exclusion Signed-off-by: Vladimir Cherepanov <[email protected]> * Run L1_cpp_distributed tests if at least 4 GPUs Signed-off-by: Vladimir Cherepanov <[email protected]> --------- Signed-off-by: Vladimir Cherepanov <[email protected]> Adds context parallelism utilities: moving cp shards to diff ranks and pad sequence to divisibility factory (NVIDIA#2129) * test - adds unit test for cp utilities and the utilites Signed-off-by: Jonathan Mitchell <[email protected]> * assert line change Signed-off-by: Jonathan Mitchell <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Jonathan Mitchell <[email protected]> Co-authored-by: Jonathan Mitchell <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Sudhakar Singh <[email protected]>

add bf16/fp32 token-per-expert on the moe-loss-computation on router …

de887b4

…fusion Signed-off-by: tongliu <[email protected]>

Autumn1998 force-pushed the tongliu/router_fusion branch from 5708f20 to de887b4 Compare September 8, 2025 07:08

[pre-commit.ci] auto fixes from pre-commit.com hooks

0dea4d7

for more information, see https://pre-commit.ci

yaox12 changed the title ~~add bf16/fp32 token-per-expert on the moe-loss-computation on router …~~ Add bf16/fp32 token-per-expert to the MoE aux loss kernel Sep 8, 2025

yaox12 approved these changes Sep 9, 2025

View reviewed changes

yaox12 merged commit a26a7f1 into NVIDIA:main Sep 9, 2025
41 checks passed

functionstackx mentioned this pull request Sep 14, 2025

TranformerEngine. Parity with NVIDIA - ROCm/TE CI should be open & follow the open source ethos ROCm/TransformerEngine#313

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add bf16/fp32 token-per-expert to the MoE aux loss kernel #2162

Add bf16/fp32 token-per-expert to the MoE aux loss kernel #2162

Uh oh!

Autumn1998 commented Sep 8, 2025 •

edited by yaox12

Loading

Uh oh!

yaox12 commented Sep 8, 2025

Uh oh!

Uh oh!

Uh oh!

Add bf16/fp32 token-per-expert to the MoE aux loss kernel #2162

Add bf16/fp32 token-per-expert to the MoE aux loss kernel #2162

Uh oh!

Conversation

Autumn1998 commented Sep 8, 2025 • edited by yaox12 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Changes

Checklist:

Uh oh!

yaox12 commented Sep 8, 2025

Uh oh!

Uh oh!

Uh oh!

Autumn1998 commented Sep 8, 2025 •

edited by yaox12

Loading