issues on llama3 compile + (async) TP + AC

### Bug description

on 8 GPUs, DP2 TP4 

1. compile + selective op AC + TP:
got failure
```
File "/data/users/lty/pytorch/torch/_ops.py", line 1317, in __getattr__
      raise AttributeError(
  AttributeError: '_OpNamespace' 'symm_mem' object has no attribute 'fused_all_gather_matmul'
```
**Note**: DP4 TP2 works.

2. compile + selective op AC + async TP:
got very low throughput, compared with "full AC + TP"

3. compile + full AC + asyncTP:
got the following warning and very low throughput (compared with "full AC + TP")
```
torch/_inductor/fx_passes/micro_pipeline_tp.py:894] [0/1] no producer matmul found for reduce scatter, skipping fuse_matmul_reduce_scatter fusion
```

4. compile + async TP (+ selective 2 ac) is still failing on CI machines https://github.com/pytorch/torchtitan/actions/runs/14992456541/job/42118899398?pr=1186

### Versions

latest pytorch built from source
on A100 GPUs
debug model

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

issues on llama3 compile + (async) TP + AC #1185

Bug description

Versions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

issues on llama3 compile + (async) TP + AC #1185

Description

Bug description

Versions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions