JointGraph-based Training Prototype #1794

SherlockNoMad · 2025-10-03T07:17:48Z

This is an e2e prototype to run llama3-simplefsdp using export-y aot_autograd workflow.

Setup: shard_dp = 2, tp = 4.

MVP

[Done] Start with a simpleFSDP model, enable TP + FSDP
[Done] Apply aot_export_joing_with_descriptor on parallelized module with DTensor input to get the joint graph
[Done] Apply min_cut_partitioner to get forward and backward graph module
[Done but Need verification] Apply prefect/bucketing graph passes on fw_gm and bw_gm to reorder/group the communication collectives
[Done] Run the joint graph with aot_compile_joint_with_descriptors

Issues

fwd_rng_state show up in the aot_export_joint grpah input pytorch#164559
[DTensor] Improve Sharding propagation error message pytorch#164543
What's input order for aot_export_joint graph? using model.parameter() 's order as input seems wrong.

Repro steps:
NGPU=8 CONFIG_FILE="./torchtitan/models/llama3/train_configs/debug_model.toml" with-proxy ./run_train.sh --model.name joint_graph_runner.llama3 --compile.enable --parallelism.data_parallel_shard_degree=2 --parallelism.tensor_parallel_degree=4

Run with FlexAttention:
NGPU=8 CONFIG_FILE="./torchtitan/models/llama3/train_configs/debug_model.toml" with-proxy ./run_train.sh --model.name joint_graph_runner.llama3 --compile.enable --parallelism.data_paral
lel_shard_degree=2 --parallelism.tensor_parallel_degree=4 --model.flavor=debugmodel_flex_attn

Sample output:
P1975157784: rank0_autograd_function_0fea2786.py
P1975158481: rank1_autograd_function_28587623.py

tianyu-l

Is this for exploration purpose? If so I'd suggest we work in a branch / fork.

SherlockNoMad · 2025-10-10T16:37:03Z

torchtitan/experiments/joint_graph_runner/llama3/parallelize.py

+        # Hack: convert args and kwargs to DTensor. This should be fixed at data loader.
+        # This works, but kinda cheating?
+        dt_args = tuple(
+            DTensor.from_local(arg, self.parallel_dims.world_mesh["tp"], [Replicate()])
+            for arg in args
+        )
+
+        # RuntimeError('Sharding propagation failed for Op(op=aten.embedding.default, args_schema=Spec(S(0) on (2048, 256)), Spec((Shard(dim=0), Replicate()) on (16, 2048)) @ mesh: (2, 4))')
+        # dt_args = tuple(DTensor.from_local(arg, self.parallel_dims.world_mesh, [Shard(0), Replicate()]) for arg in args)
+
+        # RuntimeError('Sharding propagation failed for Op(op=aten.embedding.default, args_schema=Spec(S(0) on (2048, 256)), Spec(S(0) on (16, 2048)) @ mesh: (2,))')
+        # dt_args = tuple(DTensor.from_local(arg, self.parallel_dims.world_mesh["dp_shard"], [Shard(0)]) for arg in args)


xmfan and others added 5 commits October 2, 2025 12:32

Fork SimpleFSDP

732f0ca

Hijack the execution flow for a single training loop

2d047a6

Introduce Joint Graph Runner

4f8a0ba

convert inputs into DTensor

3919e78

fixes

35ce6a0

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 3, 2025

SherlockNoMad changed the title ~~Joint Graph Runner~~ JointGraph-based Training Prototype Oct 3, 2025

SherlockNoMad mentioned this pull request Oct 3, 2025

fwd_rng_state show up in the aot_export_joint grpah input pytorch/pytorch#164559

Open

yiming0416 mentioned this pull request Oct 8, 2025

Add joint graph runner deepseek_v3 experiment #1841

Draft

SherlockNoMad added 4 commits October 8, 2025 15:35

use aot_compile_joint_with_descriptors

51c1538

apply fw/bw compiler

b2684c0

apply schedule_overlap_bucketing

1cdeb61

Clean up

cf845c0

SherlockNoMad marked this pull request as ready for review October 9, 2025 00:31

lint

b262d9a

SherlockNoMad requested review from anijain2305, ezyang and tianyu-l October 9, 2025 17:20

tianyu-l requested changes Oct 9, 2025

View reviewed changes

patch _restore_state_dict

7535a4a

SherlockNoMad requested review from fegin, wconstab and wwwjn as code owners October 10, 2025 04:58

SherlockNoMad commented Oct 10, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

JointGraph-based Training Prototype #1794

JointGraph-based Training Prototype #1794

SherlockNoMad commented Oct 3, 2025 •

edited

Loading

Uh oh!

tianyu-l left a comment

Uh oh!

SherlockNoMad Oct 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

JointGraph-based Training Prototype #1794

Are you sure you want to change the base?

JointGraph-based Training Prototype #1794

Conversation

SherlockNoMad commented Oct 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

SherlockNoMad Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

SherlockNoMad commented Oct 3, 2025 •

edited

Loading