You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+5-5Lines changed: 5 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -43,18 +43,18 @@ Currently we showcase pre-training **Llama 3 and Llama 2** LLMs of various sizes
43
43
6. Learning rate scheduler, meta init, Optional Fused RMSNorm
44
44
7. All options easily configured via [toml files](train_configs/)
45
45
8.[Interoperable checkpoints](docs/checkpoint.md) which can be loaded directly into [`torchtune`](https://github.com/pytorch/torchtune) for fine tuning
46
+
9.[Float8 support](docs/float8.md)
46
47
47
48
We report our [Performance](docs/performance.md) verified on 64 A100 GPUs
*`--float8.enable_float8_linear`: swap `nn.Linear` with `Float8Linear` to perform float8 matmul.
13
+
*`--float8.enable_fsdp_float8_all_gather`: cast `Float8Linear.weight` from high precision to float8 before FSDP all-gather so we can communicate in float8 to save bandwidth.
14
+
*`--float8.precompute_float8_dynamic_scale_for_fsdp` (optional): communicate AMAX/scales efficiently in a single all-reduce for all parameters instead of doing many small all-reduce for each parameter.
15
+
16
+
For parallelisms, we support float8 all-gather for FSDP (optional) and for TP (by default for `Float8Linear`).
17
+
18
+
For scaling strategy, we currently support tensor-wise scaling with dynamic scales, and are actively working on tensor-wise scaling with delayed scales. Row-wise scaling is under exploration.
0 commit comments