Skip to content

Commit 7781fd7

Browse files
authored
[TorchTitan][Checkpoint] Add a step-by-step instruction for checkpoint conversion (#235)
1 parent 10b572d commit 7781fd7

File tree

1 file changed

+55
-0
lines changed

1 file changed

+55
-0
lines changed

docs/checkpoint.md

Lines changed: 55 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,55 @@
1+
# How to Convert a TorchTitan Checkpoint for Use in TorchTune
2+
3+
This guide will walk you through the steps required to convert a checkpoint from TorchTitan so that it can be loaded into TorchTune.
4+
5+
## Steps
6+
1. ENABLE CHECKPOINTING
7+
In your TorchTitan training config, ensure that `enable_checkpoint` is set to True.
8+
```
9+
[checkpoint]
10+
enable_checkpoint = true
11+
folder = "checkpoint"
12+
interval_type = "steps"
13+
interval = 5
14+
```
15+
16+
17+
2. SAVE ONLY MODEL WEIGHTS
18+
By setting `model_weights_only` to `True`, the checkpoint will only contain the model weights and exclude the optimizer state and extra train states, resulting in a smaller checkpoint size.
19+
```
20+
[checkpoint]
21+
enable_checkpoint = true
22+
model_weights_only = true
23+
```
24+
25+
3. CHOOSE DESIRED EXPORT PRECISION
26+
The default model states are in `float32`. You can choose to export the checkpoint in a lower precision format such as `bfloat16`.
27+
```
28+
[checkpoint]
29+
enable_checkpoint = true
30+
model_weights_only = true
31+
export_dtype = "bfloat16"
32+
```
33+
34+
4. EXAMPLE CHECKPOINT CONFIGURATION
35+
```
36+
[checkpoint]
37+
enable_checkpoint = true
38+
folder = "checkpoint"
39+
interval_type = "steps"
40+
interval = 5
41+
model_weights_only = true
42+
export_dtype = "bfloat16"
43+
```
44+
45+
5. SAVE THE FINAL CHECKPOINT\
46+
Once the above have been set, the final checkpoint at the end of the training step will consist of model weights only with the desired export dtype. However, if the final step has not been reached yet, full checkpoints will still be saved so that training can be resumed.
47+
48+
6. CONVERT SHARDED CHECKPOINTS TO A SINGLE FILE\
49+
Finally, once you have obtained the last checkpoint, you can use the following command to convert the sharded checkpoints to a single .pt file that can be loaded into TorchTune:
50+
51+
```
52+
python -m torch.distributed.checkpoint.format_utils dcp_to_torch torchtitan/outputs/checkpoint/step-1000 checkpoint.pt
53+
```
54+
55+
That's it. You have now successfully converted a sharded TorchTitan checkpoint for use in TorchTune.

0 commit comments

Comments
 (0)