Skip to content

[not for land] testing out float8 128_1_128_128 blockwise scaling #1317

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

vkuzo
Copy link
Contributor

@vkuzo vkuzo commented Jun 18, 2025

Summary:

Test drive of pytorch/ao#2386, not for land

Test Plan:

with-proxy CONFIG_FILE="./torchtitan/models/llama3/train_configs/debug_model.toml" ./run_train.sh --model.converters float8 --model.print_after_conversion
...
[rank0]:      (feed_forward): FeedForward(                                                                       
[rank0]:        (w1): DeepGemmFloat8Linear(in_features=256, out_features=768, bias=False)
[rank0]:        (w2): DeepGemmFloat8Linear(in_features=768, out_features=256, bias=False)
[rank0]:        (w3): DeepGemmFloat8Linear(in_features=256, out_features=768, bias=False)
[rank0]:      )                                                                                                  
[rank0]:      (attention_norm): RMSNorm((256,), eps=1e-05, elementwise_affine=True)
[rank0]:      (ffn_norm): RMSNorm((256,), eps=1e-05, elementwise_affine=True)
[rank0]:    )
[rank0]:  )                   
[rank0]:  (norm): RMSNorm((256,), eps=1e-05, elementwise_affine=True)
[rank0]:  (output): Linear(in_features=256, out_features=2256, bias=False)
[rank0]:)                                                                                                        
[rank0]:
[rank0]:                                                                                                         
[rank0]:[titan] 2025-06-18 05:44:06,665 - root - INFO - CUDA capacity: NVIDIA H100 with 95.00GiB memory
[rank0]:[titan] 2025-06-18 05:44:06,769 - root - INFO - Model llama3 debugmodel size: 6,270,208 total parameters
[rank0]:[titan] 2025-06-18 05:44:06,769 - root - INFO - Applied selective activation checkpointing to the model
[rank0]:[titan] 2025-06-18 05:44:06,785 - root - INFO - Applied FSDP to the model
[rank0]:[titan] 2025-06-18 05:44:07,035 - root - INFO - Peak FLOPS used for computing MFU: 9.890e+14
[rank0]:[titan] 2025-06-18 05:44:07,035 - root - INFO - CUDA memory usage for model: 0.00GiB(0.00%)
[rank0]:[titan] 2025-06-18 05:44:07,035 - root - INFO - Mixed precision training is handled by fully_shard
[rank0]:[titan] 2025-06-18 05:44:07,035 - root - INFO - Trainer is initialized with local batch size 8, global batch size 64, gradient accumulation steps 1, sequence length 2048, total steps 10 (warmup 2).
[rank0]:[titan] 2025-06-18 05:44:07,036 - root - INFO - Training starts at step 1.
[rank0]:[titan] 2025-06-18 05:44:07,605 - root - INFO - step:  1  loss:  8.2282  memory:  1.25GiB(1.32%)  tps: 19,578  tflops: 1.41  mfu: 0.14%
[rank0]:[titan] 2025-06-18 05:44:07,605 - root - INFO - Synchronizing and adjusting timeout for all ProcessGroups to 0:01:40
[rank0]:[titan] 2025-06-18 05:44:07,756 - root - INFO - step:  2  loss:  8.1737  memory:  1.29GiB(1.36%)  tps: 109,161  tflops: 7.85  mfu: 0.79%
[rank0]:[titan] 2025-06-18 05:44:07,898 - root - INFO - step:  3  loss:  8.0697  memory:  1.29GiB(1.36%)  tps: 115,242  tflops: 8.29  mfu: 0.84%
[rank0]:[titan] 2025-06-18 05:44:08,041 - root - INFO - step:  4  loss:  7.9331  memory:  1.29GiB(1.36%)  tps: 114,637  tflops: 8.24  mfu: 0.83%
[rank0]:[titan] 2025-06-18 05:44:08,183 - root - INFO - step:  5  loss:  7.7931  memory:  1.29GiB(1.36%)  tps: 115,818  tflops: 8.33  mfu: 0.84%
[rank0]:[titan] 2025-06-18 05:44:08,327 - root - INFO - step:  6  loss:  7.5978  memory:  1.29GiB(1.36%)  tps: 114,139  tflops: 8.21  mfu: 0.83%
[rank0]:[titan] 2025-06-18 05:44:08,472 - root - INFO - step:  7  loss:  7.4241  memory:  1.29GiB(1.36%)  tps: 112,785  tflops: 8.11  mfu: 0.82%
[rank0]:[titan] 2025-06-18 05:44:08,618 - root - INFO - step:  8  loss:  7.2728  memory:  1.29GiB(1.36%)  tps: 112,992  tflops: 8.12  mfu: 0.82%
[rank0]:[titan] 2025-06-18 05:44:08,757 - root - INFO - step:  9  loss:  7.1301  memory:  1.29GiB(1.36%)  tps: 117,572  tflops: 8.45  mfu: 0.85%
[rank0]:[titan] 2025-06-18 05:44:08,896 - root - INFO - step: 10  loss:  7.0791  memory:  1.29GiB(1.36%)  tps: 118,432  tflops: 8.52  mfu: 0.86%
[rank0]:[titan] 2025-06-18 05:44:08,896 - root - INFO - Sleeping 2 seconds for other ranks to complete
[rank0]:[titan] 2025-06-18 05:44:10,896 - root - INFO - Training completed                      

Reviewers:

Subscribers:

Tasks:

Tags:

Summary:

Test drive of pytorch/ao#2386, not for land

Test Plan:

```bash
with-proxy CONFIG_FILE="./torchtitan/models/llama3/train_configs/debug_model.toml" ./run_train.sh --model.converters float8 --model.print_after_conversion
```

Reviewers:

Subscribers:

Tasks:

Tags:
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants