[not for land] testing out float8 128_1_128_128 blockwise scaling #1317

vkuzo · 2025-06-18T13:09:13Z

Summary:

Test drive of pytorch/ao#2386, not for land

Test Plan:

with-proxy CONFIG_FILE="./torchtitan/models/llama3/train_configs/debug_model.toml" ./run_train.sh --model.converters float8 --model.print_after_conversion
...
[rank0]:      (feed_forward): FeedForward(                                                                       
[rank0]:        (w1): DeepGemmFloat8Linear(in_features=256, out_features=768, bias=False)
[rank0]:        (w2): DeepGemmFloat8Linear(in_features=768, out_features=256, bias=False)
[rank0]:        (w3): DeepGemmFloat8Linear(in_features=256, out_features=768, bias=False)
[rank0]:      )                                                                                                  
[rank0]:      (attention_norm): RMSNorm((256,), eps=1e-05, elementwise_affine=True)
[rank0]:      (ffn_norm): RMSNorm((256,), eps=1e-05, elementwise_affine=True)
[rank0]:    )
[rank0]:  )                   
[rank0]:  (norm): RMSNorm((256,), eps=1e-05, elementwise_affine=True)
[rank0]:  (output): Linear(in_features=256, out_features=2256, bias=False)
[rank0]:)                                                                                                        
[rank0]:
[rank0]:                                                                                                         
[rank0]:[titan] 2025-06-18 05:44:06,665 - root - INFO - CUDA capacity: NVIDIA H100 with 95.00GiB memory
[rank0]:[titan] 2025-06-18 05:44:06,769 - root - INFO - Model llama3 debugmodel size: 6,270,208 total parameters
[rank0]:[titan] 2025-06-18 05:44:06,769 - root - INFO - Applied selective activation checkpointing to the model
[rank0]:[titan] 2025-06-18 05:44:06,785 - root - INFO - Applied FSDP to the model
[rank0]:[titan] 2025-06-18 05:44:07,035 - root - INFO - Peak FLOPS used for computing MFU: 9.890e+14
[rank0]:[titan] 2025-06-18 05:44:07,035 - root - INFO - CUDA memory usage for model: 0.00GiB(0.00%)
[rank0]:[titan] 2025-06-18 05:44:07,035 - root - INFO - Mixed precision training is handled by fully_shard
[rank0]:[titan] 2025-06-18 05:44:07,035 - root - INFO - Trainer is initialized with local batch size 8, global batch size 64, gradient accumulation steps 1, sequence length 2048, total steps 10 (warmup 2).
[rank0]:[titan] 2025-06-18 05:44:07,036 - root - INFO - Training starts at step 1.
[rank0]:[titan] 2025-06-18 05:44:07,605 - root - INFO - step:  1  loss:  8.2282  memory:  1.25GiB(1.32%)  tps: 19,578  tflops: 1.41  mfu: 0.14%
[rank0]:[titan] 2025-06-18 05:44:07,605 - root - INFO - Synchronizing and adjusting timeout for all ProcessGroups to 0:01:40
[rank0]:[titan] 2025-06-18 05:44:07,756 - root - INFO - step:  2  loss:  8.1737  memory:  1.29GiB(1.36%)  tps: 109,161  tflops: 7.85  mfu: 0.79%
[rank0]:[titan] 2025-06-18 05:44:07,898 - root - INFO - step:  3  loss:  8.0697  memory:  1.29GiB(1.36%)  tps: 115,242  tflops: 8.29  mfu: 0.84%
[rank0]:[titan] 2025-06-18 05:44:08,041 - root - INFO - step:  4  loss:  7.9331  memory:  1.29GiB(1.36%)  tps: 114,637  tflops: 8.24  mfu: 0.83%
[rank0]:[titan] 2025-06-18 05:44:08,183 - root - INFO - step:  5  loss:  7.7931  memory:  1.29GiB(1.36%)  tps: 115,818  tflops: 8.33  mfu: 0.84%
[rank0]:[titan] 2025-06-18 05:44:08,327 - root - INFO - step:  6  loss:  7.5978  memory:  1.29GiB(1.36%)  tps: 114,139  tflops: 8.21  mfu: 0.83%
[rank0]:[titan] 2025-06-18 05:44:08,472 - root - INFO - step:  7  loss:  7.4241  memory:  1.29GiB(1.36%)  tps: 112,785  tflops: 8.11  mfu: 0.82%
[rank0]:[titan] 2025-06-18 05:44:08,618 - root - INFO - step:  8  loss:  7.2728  memory:  1.29GiB(1.36%)  tps: 112,992  tflops: 8.12  mfu: 0.82%
[rank0]:[titan] 2025-06-18 05:44:08,757 - root - INFO - step:  9  loss:  7.1301  memory:  1.29GiB(1.36%)  tps: 117,572  tflops: 8.45  mfu: 0.85%
[rank0]:[titan] 2025-06-18 05:44:08,896 - root - INFO - step: 10  loss:  7.0791  memory:  1.29GiB(1.36%)  tps: 118,432  tflops: 8.52  mfu: 0.86%
[rank0]:[titan] 2025-06-18 05:44:08,896 - root - INFO - Sleeping 2 seconds for other ranks to complete
[rank0]:[titan] 2025-06-18 05:44:10,896 - root - INFO - Training completed

Reviewers:

Subscribers:

Tasks:

Tags:

Summary: Test drive of pytorch/ao#2386, not for land Test Plan: ```bash with-proxy CONFIG_FILE="./torchtitan/models/llama3/train_configs/debug_model.toml" ./run_train.sh --model.converters float8 --model.print_after_conversion ``` Reviewers: Subscribers: Tasks: Tags:

vkuzo requested review from tianyu-l, fegin and wwwjn as code owners June 18, 2025 13:09

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jun 18, 2025

vkuzo mentioned this pull request Jun 18, 2025

[not for land] float8 blockwise scaling training prototype using deep_gemm pytorch/ao#2386

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[not for land] testing out float8 128_1_128_128 blockwise scaling #1317

[not for land] testing out float8 128_1_128_128 blockwise scaling #1317

Uh oh!

vkuzo commented Jun 18, 2025 •

edited

Loading

Uh oh!

Uh oh!

[not for land] testing out float8 128_1_128_128 blockwise scaling #1317

Are you sure you want to change the base?

[not for land] testing out float8 128_1_128_128 blockwise scaling #1317

Uh oh!

Conversation

vkuzo commented Jun 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

vkuzo commented Jun 18, 2025 •

edited

Loading