[roadmap/tracker] Low precision MoE training

Creating this issue as a roadmap/tracker for enabling float8 training for MoEs with token-choice routing. Both core requirements as well as ideas for additional performance optimizations are included.

**UPDATE** 07/22/2025: revised priorities to reflect shifting focus to prioritize mxfp8

**This is not an exhaustive list, but highlights some primary milestones / requirements**

## Compute
- [x] mxpf8  
    - [x] mxfp8 scaled grouped gemm
        - [X] 2d-3d gemm for output and dX (#2848)
        - [x] 2d-2d gemm for dW (https://github.com/pytorch/FBGEMM/pull/4816)
    - [x] torchao differentiable _scaled_grouped_mm support for mxpf8 recipe for dynamic quant before grouped GEMMs
    - [x] triton kernels for per token group scale conversion to blocked swizzled format
        - [x] for 2d inputs with groups along M (#2886)
        - [x] for 3d expert weights (#2894)
        - [x] for 2d in puts with groups along K (https://github.com/pytorch/ao/pull/2956)
- [ ] fp8 rowwise
    - [x] Add torch._scaled_grouped_mm kernel in core 
        - https://github.com/pytorch/pytorch/pull/150374 (by @ngimel)
    - [x] Add differentiable scaled grouped mm with dynamic float8 rowwise quant in torchao
        - https://github.com/pytorch/ao/pull/1969
    - [x] Add custom kernels in torchao for performing per-group scaling on device, to avoid host-device sync
        - https://github.com/pytorch/ao/pull/2064
        - https://github.com/pytorch/ao/pull/2077
    - [ ] Faster inductor codegen kernels for dynamic quant of 3d tensors along dim1: https://github.com/pytorch/pytorch/issues/159769
        - [x] alternatively, handwritten triton kernel faster than torch.compile for this (https://github.com/pytorch/ao/pull/2696)
            - [ ] this also needs to be faster https://github.com/pytorch/ao/issues/2880  
- [ ] fp8 blockwise
    - [ ] quant primitives
    - [ ] Explore DeepGEMM for fp8 blockwise grouped GEMM
    - [ ] triton kernels to do scaling per group without d2h sync

## Communication
I looked at traces and validated "all to all dispatch and shuffle -> grouped gemm -> all to all combine and unshuffle" are all sequentially dependent, so in theory faster/low precision comms should improve performance. There is some overlap with the shared expert computation, but it is not 100% overlap, so there is room for optimization. This will be especially important if/when "all to all" spans multiple nodes, where inter-node network bandwidth is lower than the intra-node NVLink bandwidth. 

- [ ] mxfp8
    - [x] 1d on device all_to_all_v comms kernel (differentiable; with dynamic quant) https://github.com/pytorch/ao/pull/3048
    - [ ]  NVSHMEM put/get primitives instead of `tl.load`/`tl.store` to support inter-node a2a
- [ ] float8 blockwise (P0)



 
## Torchao UX
- [X] Add tensor subclass (ScaledGroupedMMTensor) with an op override for `torch.aten._grouped_mm` => runs differentiable scaled grouped mm
    - https://github.com/pytorch/ao/pull/2275
- [X] Add one line model conversion API, should recursively swap nn.Parameter data tensors of the expert weights with ScaledGroupedMMTensor. 
    - https://github.com/pytorch/ao/pull/2275
- [X] support configurable recipe (fp8 blockwise/rowwise, mxpf8) 

## Compile support
- [x] Compile support for `torch._grouped_mm`
    - done by @bdhirsh in https://github.com/pytorch/pytorch/pull/153384
- [X] Differentiable _scaled_grouped_mm can compile with `fullgraph=True`
- [X] E2E compilation of each TranformerBlock in torchtitan after MoE conversion via tensor subclass approach (fullgraph=False)
- [ ] E2E compilation of each TranformerBlock in torchtitan after MoE conversion via tensor subclass approach (fullgraph=True)

## Distributed support
- [x] Composability with FSDP2 (will likely need something like [this](https://github.com/pytorch/ao/blob/1017c7e3bfe7300a14ed81fa36038684b168b633/torchao/float8/fsdp_utils.py#L129) for the new tensor subclass)
    - [x] mxfp8 (P0)
    - [ ] float8 blockwise (P0)
    - [x] float8 rowwise (P1) https://github.com/pytorch/ao/pull/2413 
- [ ] Composability with TP
    - [x] mxfp8 (P0)
    - [ ] float8 blockwise (P0)
    - [x] float8 rowwise (P1) https://github.com/pytorch/ao/pull/2473
- [ ] Composability with FSDP + TP
    - [x] mxfp8 (P0)
    - [ ] float8 blockwise (P0)
    - [x] float8 rowwise (P1) https://github.com/pytorch/ao/pull/2475
- [ ] Composability with dp2ep as implemented here: https://github.com/pytorch/torchtitan/pull/1324
    - [x] mxfp8 (P0)
    - [ ] float8 blockwise (P0)
    - [x] float8 rowwise (P1) https://github.com/pytorch/ao/pull/2481


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[roadmap/tracker] Low precision MoE training #2147

Compute

Communication

Torchao UX

Compile support

Distributed support

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[roadmap/tracker] Low precision MoE training #2147

Description

Compute

Communication

Torchao UX

Compile support

Distributed support

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions