Skip to content

Conversation

Autumn1998
Copy link
Contributor

@Autumn1998 Autumn1998 commented Sep 8, 2025

Description

Add the new datatype(1. bf16, 2. fp32) support on the token-per-expert on the moe-aux-loss-computation kernel.

Fixes # (issue)

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

Please list the changes introduced in this PR:

  • Change A
  • Change B

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

@Autumn1998 Autumn1998 force-pushed the tongliu/router_fusion branch from 5708f20 to de887b4 Compare September 8, 2025 07:08
@yaox12
Copy link
Member

yaox12 commented Sep 8, 2025

/te-ci

@yaox12 yaox12 changed the title add bf16/fp32 token-per-expert on the moe-loss-computation on router … Add bf16/fp32 token-per-expert to the MoE aux loss kernel Sep 8, 2025
@yaox12 yaox12 merged commit a26a7f1 into NVIDIA:main Sep 9, 2025
41 checks passed
vthumbe1503 added a commit to vthumbe1503/TransformerEngine that referenced this pull request Sep 19, 2025
Signed-off-by: vthumbe1503 <[email protected]>

Add bf16/fp32 token-per-expert to the MoE aux loss kernel (NVIDIA#2162)

* add bf16/fp32 token-per-expert on the moe-loss-computation on router fusion

Signed-off-by: tongliu <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: tongliu <[email protected]>
Co-authored-by: tongliu <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

[JAX] Scale swizzling via JAX transpose op (NVIDIA#2163)

* add swizzle in jax

Signed-off-by: Phuong Nguyen <[email protected]>

* added outer_impl

Signed-off-by: Phuong Nguyen <[email protected]>

* clean up FFI

Signed-off-by: Phuong Nguyen <[email protected]>

---------

Signed-off-by: Phuong Nguyen <[email protected]>

Extract cpp distributed tests into a separate project (NVIDIA#2165)

* Extract cpp distributed tests into a separate project

Signed-off-by: Vladimir Cherepanov <[email protected]>

* Remove obsolete exclusion

Signed-off-by: Vladimir Cherepanov <[email protected]>

* Run L1_cpp_distributed tests if at least 4 GPUs

Signed-off-by: Vladimir Cherepanov <[email protected]>

---------

Signed-off-by: Vladimir Cherepanov <[email protected]>

Adds context parallelism utilities: moving cp shards to diff ranks and pad sequence to divisibility factory (NVIDIA#2129)

* test - adds unit test for cp utilities and the utilites

Signed-off-by: Jonathan Mitchell <[email protected]>

* assert line change

Signed-off-by: Jonathan Mitchell <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Jonathan Mitchell <[email protected]>
Co-authored-by: Jonathan Mitchell <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Sudhakar Singh <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants