Skip to content

issues on llama3 compile + (async) TP + AC #1185

@tianyu-l

Description

@tianyu-l

Bug description

on 8 GPUs, DP2 TP4

  1. compile + selective op AC + TP:
    got failure
File "/data/users/lty/pytorch/torch/_ops.py", line 1317, in __getattr__
      raise AttributeError(
  AttributeError: '_OpNamespace' 'symm_mem' object has no attribute 'fused_all_gather_matmul'

Note: DP4 TP2 works.

  1. compile + selective op AC + async TP:
    got very low throughput, compared with "full AC + TP"

  2. compile + full AC + asyncTP:
    got the following warning and very low throughput (compared with "full AC + TP")

torch/_inductor/fx_passes/micro_pipeline_tp.py:894] [0/1] no producer matmul found for reduce scatter, skipping fuse_matmul_reduce_scatter fusion
  1. compile + async TP (+ selective 2 ac) is still failing on CI machines https://github.com/pytorch/torchtitan/actions/runs/14992456541/job/42118899398?pr=1186

Versions

latest pytorch built from source
on A100 GPUs
debug model

Metadata

Metadata

Assignees

No one assigned

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions