-
Notifications
You must be signed in to change notification settings - Fork 495
Closed
Description
Bug description
on 8 GPUs, DP2 TP4
- compile + selective op AC + TP:
got failure
File "/data/users/lty/pytorch/torch/_ops.py", line 1317, in __getattr__
raise AttributeError(
AttributeError: '_OpNamespace' 'symm_mem' object has no attribute 'fused_all_gather_matmul'
Note: DP4 TP2 works.
-
compile + selective op AC + async TP:
got very low throughput, compared with "full AC + TP" -
compile + full AC + asyncTP:
got the following warning and very low throughput (compared with "full AC + TP")
torch/_inductor/fx_passes/micro_pipeline_tp.py:894] [0/1] no producer matmul found for reduce scatter, skipping fuse_matmul_reduce_scatter fusion
- compile + async TP (+ selective 2 ac) is still failing on CI machines https://github.com/pytorch/torchtitan/actions/runs/14992456541/job/42118899398?pr=1186
Versions
latest pytorch built from source
on A100 GPUs
debug model