Add Sequence Parallelism to llama #32

wanchaol · 2024-02-01T22:15:49Z

Stack from ghstack (oldest at bottom):

-> Add Sequence Parallelism to llama #32

Somehow the torch.compile not working although eager sequence
parallelism working, so currently don't turn it on by default

Somehow the torch.compile not working although eager sequence parallelism working, so currently don't turn it on by default [ghstack-poisoned]

Somehow the torch.compile not working although eager sequence parallelism working, so currently don't turn it on by default ghstack-source-id: 27c7a07 Pull Request resolved: #32

Somehow the torch.compile not working although eager sequence parallelism working, so currently don't turn it on by default [ghstack-poisoned]

Somehow the torch.compile not working although eager sequence parallelism working, so currently don't turn it on by default ghstack-source-id: 73c094b Pull Request resolved: #32

wconstab · 2024-02-02T05:44:25Z

torchtrain/parallelisms/parallelize_llama.py

+                parallelize_plan=layer_plan,
+            )
+
+        rank0_log(f"Applied Sequence Parallelism to the model...")


I wonder if its useful to log more info about the SP plan. I was thinking about it for PP, what info do we want to print. Should each parallelism print its own summary, or should we have one overall function that prints overall parallel info in a unified way?

🤔 That's a good point. I think yeah we should probably log the parallelize plan for SP. This would require some changes in PyTorch to add __str__ to our ParallelStyles, I can add the log once the PyTorch PR is merged.

Should each parallelism print its own summary, or should we have one overall function that prints overall parallel info in a unified way

My two cents: It's a bit tricky to give overall summary. I think we can figure out how to even print the intended summary for each parallelism first, i.e. when transformerblock stacked too many, we can't log/print every layer parallel plan, so I think maybe we print pp degree of transformerblock, and we might not want to print the SP plan for each PP transformerblock.

tianyu-l

looks great to me! one inline question

tianyu-l · 2024-02-02T20:29:56Z

torchtrain/parallelisms/parallelize_llama.py

+            distribute_rmsnorm(transformer_block.attention_norm, tp_mesh)
+            distribute_rmsnorm(transformer_block.ffn_norm, tp_mesh)


shall we also apply it on the final norm after all transformer blocks?

not sth currently enabled, but I think we can explore this in real training and see if shard the final norm would give additional memory/perf benefits :)

torchtrain/parallelisms/parallelize_llama.py

Somehow the torch.compile not working although eager sequence parallelism working, so currently don't turn it on by default [ghstack-poisoned]

Somehow the torch.compile not working although eager sequence parallelism working, so currently don't turn it on by default ghstack-source-id: 16fe643 Pull Request resolved: #32

Somehow the torch.compile not working although eager sequence parallelism working, so currently don't turn it on by default [ghstack-poisoned]

Somehow the torch.compile not working although eager sequence parallelism working, so currently don't turn it on by default ghstack-source-id: c1b8f3a Pull Request resolved: #32

Somehow the torch.compile not working although eager sequence parallelism working, so currently don't turn it on by default [ghstack-poisoned]

Somehow the torch.compile not working although eager sequence parallelism working, so currently don't turn it on by default ghstack-source-id: 0d251f2 Pull Request resolved: #32

* Add other filtering options beyond group * Address review comments

Add Sequence Parallelism to llama

7569c1e

Somehow the torch.compile not working although eager sequence parallelism working, so currently don't turn it on by default [ghstack-poisoned]

wanchaol mentioned this pull request Feb 1, 2024

Update dataloading to use packing with fixed seq_length #31

Merged

wanchaol added a commit that referenced this pull request Feb 1, 2024

Add Sequence Parallelism to llama

115d634

Somehow the torch.compile not working although eager sequence parallelism working, so currently don't turn it on by default ghstack-source-id: 27c7a07 Pull Request resolved: #32

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Feb 1, 2024

Update on "Add Sequence Parallelism to llama"

09e7447

Somehow the torch.compile not working although eager sequence parallelism working, so currently don't turn it on by default [ghstack-poisoned]

wanchaol added a commit that referenced this pull request Feb 1, 2024

Add Sequence Parallelism to llama

672308b

Somehow the torch.compile not working although eager sequence parallelism working, so currently don't turn it on by default ghstack-source-id: 73c094b Pull Request resolved: #32

wanchaol requested review from fegin, wconstab, lessw2020 and tianyu-l February 1, 2024 22:24

wconstab reviewed Feb 2, 2024

View reviewed changes

wconstab approved these changes Feb 2, 2024

View reviewed changes

tianyu-l approved these changes Feb 2, 2024

View reviewed changes

awgu reviewed Feb 2, 2024

View reviewed changes

torchtrain/parallelisms/parallelize_llama.py Outdated Show resolved Hide resolved

Update on "Add Sequence Parallelism to llama"

52c9091

Somehow the torch.compile not working although eager sequence parallelism working, so currently don't turn it on by default [ghstack-poisoned]

wanchaol added a commit that referenced this pull request Feb 7, 2024

Add Sequence Parallelism to llama

7cd80f5

Somehow the torch.compile not working although eager sequence parallelism working, so currently don't turn it on by default ghstack-source-id: 16fe643 Pull Request resolved: #32

Update on "Add Sequence Parallelism to llama"

ff2d82b

Somehow the torch.compile not working although eager sequence parallelism working, so currently don't turn it on by default [ghstack-poisoned]

wanchaol added a commit that referenced this pull request Feb 7, 2024

Add Sequence Parallelism to llama

9eb078b

Somehow the torch.compile not working although eager sequence parallelism working, so currently don't turn it on by default ghstack-source-id: c1b8f3a Pull Request resolved: #32

Update on "Add Sequence Parallelism to llama"

d0511d3

Somehow the torch.compile not working although eager sequence parallelism working, so currently don't turn it on by default [ghstack-poisoned]

wanchaol added a commit that referenced this pull request Feb 7, 2024

Add Sequence Parallelism to llama

2048d53

Somehow the torch.compile not working although eager sequence parallelism working, so currently don't turn it on by default ghstack-source-id: 0d251f2 Pull Request resolved: #32

wanchaol merged commit d0511d3 into gh/wanchaol/2/base Feb 7, 2024

wanchaol added a commit that referenced this pull request Feb 7, 2024

Add Sequence Parallelism to llama

7a73979

Somehow the torch.compile not working although eager sequence parallelism working, so currently don't turn it on by default ghstack-source-id: 0d251f2 Pull Request resolved: #32

wanchaol deleted the gh/wanchaol/2/head branch February 7, 2024 07:20

wanchaol mentioned this pull request Feb 7, 2024

Add sequence parallelism to llama #44

Closed

tianyu-l pushed a commit that referenced this pull request Aug 16, 2024

Add Sequence Parallelism to llama

9713ade

Somehow the torch.compile not working although eager sequence parallelism working, so currently don't turn it on by default ghstack-source-id: 0d251f2 Pull Request resolved: #32

0781532 mentioned this pull request Oct 15, 2024

Is there way to offload training memory to DRAM (using FSDP2?) for training Llama3-8B with torchtitan? #620

Closed

payoto added a commit to graphcore-research/torchtitan-fork that referenced this pull request Feb 7, 2025

Add other filtering options beyond group (pytorch#32)

2633fdf

* Add other filtering options beyond group * Address review comments

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Sequence Parallelism to llama #32

Add Sequence Parallelism to llama #32

Uh oh!

wanchaol commented Feb 1, 2024 •

edited

Loading

Uh oh!

wconstab Feb 2, 2024

Uh oh!

wanchaol Feb 2, 2024

Uh oh!

tianyu-l left a comment

Uh oh!

tianyu-l Feb 2, 2024

Uh oh!

wanchaol Feb 7, 2024

Uh oh!

Uh oh!

Uh oh!

		distribute_rmsnorm(transformer_block.attention_norm, tp_mesh)
		distribute_rmsnorm(transformer_block.ffn_norm, tp_mesh)

Add Sequence Parallelism to llama #32

Add Sequence Parallelism to llama #32

Uh oh!

Conversation

wanchaol commented Feb 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wconstab Feb 2, 2024

Choose a reason for hiding this comment

Uh oh!

wanchaol Feb 2, 2024

Choose a reason for hiding this comment

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

tianyu-l Feb 2, 2024

Choose a reason for hiding this comment

Uh oh!

wanchaol Feb 7, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

wanchaol commented Feb 1, 2024 •

edited

Loading