Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
151 commits
Select commit Hold shift + click to select a range
054f088
check in tokenizer.model for ease of dev setup (#59)
wanchaol Feb 13, 2024
bfe2b58
Add truncated llama style model init via reset parameters() (#54)
lessw2020 Feb 14, 2024
60f021a
add model num params display, gpu memory metrics (#56)
lessw2020 Feb 15, 2024
ad69e62
add TensorBoard logging with loss and wps
tianyu-l Feb 15, 2024
a4663b1
add memory metrics to TensorBoard
tianyu-l Feb 17, 2024
2daf53f
modify data split to use HF api
tianyu-l Feb 21, 2024
50d69f6
add multinode support via slurm trainer, large scale race condition f…
lessw2020 Feb 22, 2024
8ad4dcb
add configurable unique layer init, clean up lr and loss display (#64)
lessw2020 Feb 22, 2024
8097c26
add bunch of cleanups and design principle section (#71)
wanchaol Feb 23, 2024
28f431f
delete the linter to see if re-adding it helps (#80)
wconstab Feb 23, 2024
ebbb1cb
Unified config manager for toml and command line (#76)
gnadathur Feb 24, 2024
bccad90
Whc/add linter (#81)
wconstab Feb 24, 2024
ab75dbd
Add 4GPU unit test (#82)
wconstab Feb 24, 2024
468ce8f
update readme (#74)
wanchaol Feb 24, 2024
3fce6bb
move config folder to root and adjust options (#83)
wanchaol Feb 24, 2024
3b48039
add iter time tracking via cuda events, add data loading times, add c…
lessw2020 Feb 26, 2024
df77f4e
Fill missing options in toml file wih argparse defaults (#91)
gnadathur Feb 26, 2024
325951f
support infinite loop over alpaca dataset
tianyu-l Feb 26, 2024
b12b6dd
Add color to console output if local logging, auto avoid color loggin…
lessw2020 Feb 27, 2024
254279f
update GPU metrics logging to GiB (gibibytes) (#95)
lessw2020 Feb 27, 2024
4c03475
improve TensorBoard instructions in README
tianyu-l Feb 27, 2024
7ea0679
Enable libUV for torchtrain (#98)
gnadathur Feb 28, 2024
e60c573
use warmup steps for lr scheduler, ban steps == -1 (#99)
wanchaol Feb 29, 2024
900b215
Add llama 7B config (#100)
wanchaol Feb 29, 2024
6e87471
add selective activation checkpointing
tianyu-l Feb 29, 2024
1b343f2
Add job description field in toml (#101)
gnadathur Mar 1, 2024
42f8907
fix 2D parallel crash caused by all-reduce on 2D world_mesh
tianyu-l Mar 2, 2024
4042b05
Load missing keys default from argparse (#111)
gnadathur Mar 5, 2024
6529af1
Add meta_init, enable it as default init process (#84)
lessw2020 Mar 5, 2024
5f0eaea
Fix feedback from PR 111 (#113)
gnadathur Mar 5, 2024
1ce8188
fix SP minor issues
tianyu-l Mar 5, 2024
bb5c4c6
enable loss parallel in SP
tianyu-l Mar 6, 2024
f31adb0
Float8_experimental option for training (#102)
drisspg Mar 6, 2024
6927e45
add miniPile dataset for pretraining, 1M entries (solves the 'out of …
lessw2020 Mar 7, 2024
d902a47
add data loading option to load from local file system
tianyu-l Mar 7, 2024
422910b
add llama 13B configs
wanchaol Mar 9, 2024
af221ce
add llama 70B toml
wanchaol Mar 9, 2024
5e36c74
set betas and weight decay for optimizers
wanchaol Mar 9, 2024
08b332c
Add c4 dataset (177M, streaming), update multi-node support for lates…
lessw2020 Mar 9, 2024
1d11cf5
Add openwebtext dataset for larger scale training without shuffling (…
lessw2020 Mar 12, 2024
2722865
[TorchTrain][Checkpoint] Fix TrainState state_dict to unblock loading…
wz337 Mar 12, 2024
2369861
improve logging
tianyu-l Mar 13, 2024
3262a8b
use SequenceParallel style in tp/sp (#133)
wanchaol Mar 13, 2024
d9253ee
support TP-only parallelism
tianyu-l Mar 13, 2024
b42ce91
disable verbose print from profiling
tianyu-l Mar 13, 2024
3ac610b
add Selective layer activation checkpointing, single control for tur…
lessw2020 Mar 14, 2024
af56ae0
remove per iter syncronize
tianyu-l Mar 14, 2024
073909b
Shorten nccl comm timeout and enable flight recorder dumping (#103)
wconstab Mar 15, 2024
e3204c6
fix up gpu memory monitoring and logging
tianyu-l Mar 15, 2024
a257bc3
Separate timeout during init and training (#149)
wconstab Mar 15, 2024
1d6100c
Update activation check with updates to config manager (#152)
drisspg Mar 20, 2024
ae9a966
Refactor to clean up parallelisms/__init__.py
wconstab Mar 20, 2024
47bb509
enable gc control scheduling to help avoid stragglers (#148)
lessw2020 Mar 20, 2024
fcca670
Add float8 specific parallel strategies (#153)
drisspg Mar 20, 2024
5d28009
add MFU to metrics
tianyu-l Mar 20, 2024
35d881e
disable buffer reuse for compile for now (#156)
wanchaol Mar 21, 2024
f080027
refactor config manager and support cmd overrides (#157)
wanchaol Mar 22, 2024
34732f5
Add support for generating debug traces on failure
chauhang Mar 24, 2024
e008027
rename sequence_parallel to tensor_parallel (#162)
wanchaol Mar 25, 2024
44808f9
add basic AC configs for 13B and 70B (#169)
wanchaol Mar 27, 2024
bb61af0
[TorchTrain][Checkpoint] Update train state to include global_avg_los…
wz337 Mar 27, 2024
6500bc6
Basic integration test infra (#170)
gnadathur Mar 27, 2024
479694f
Add 2D integration test (FSDP + TP) (#171)
gnadathur Mar 27, 2024
02923f0
Used per-parameter FSDP (#165)
awgu Mar 28, 2024
615f9c1
plot losses in loaded TrainState to TensorBoard
tianyu-l Mar 28, 2024
b1349da
Removed setting global flag for `swap_tensors` since not needed anymore
Mar 29, 2024
65f0297
Add integration test with compile enabled (#183)
gnadathur Apr 2, 2024
e1e17c9
remove folding and unfolding of sequence dim in model.py
tianyu-l Apr 3, 2024
b9a4548
bump comm.train_timeout_seconds (#189)
wanchaol Apr 4, 2024
3686897
fix checkpoint parser
wz337 Apr 5, 2024
7872248
support sequence of tests and add checkpoint test
wz337 Apr 5, 2024
5ac3aa6
Make freqs_cis a persistent buffer for pp init
wconstab Apr 5, 2024
5379282
Delete grad scaler, which is unsupported/unused
wconstab Apr 5, 2024
d8e64cc
Factor out loss_fn to share code with pipeline par
wconstab Apr 5, 2024
0397fef
[TorchTrain] Minor fix for #197 (#204)
wz337 Apr 5, 2024
cd1e5e8
Add FusedRMSNorm (Triton kernel, +15% eager), Add NPLayerNorm, Enable…
lessw2020 Apr 5, 2024
f795361
remove .item() per iter
tianyu-l Apr 5, 2024
946780a
Removed cache_k and cache_v comments
Apr 10, 2024
18adb2f
Some more cleanups
Apr 10, 2024
ef4c5d2
avoid record streams and make color printing a config
tianyu-l Apr 10, 2024
6629659
fix SAC to use the correct reduce_scatter op (#215)
wanchaol Apr 10, 2024
ddf916e
Test runner raises exception on failures (#216)
gnadathur Apr 10, 2024
ecdbacc
Revert "Separate TransformerEmbedding layer (#33)"
wconstab Apr 10, 2024
656be68
Fix 2DParallel test (#219)
gnadathur Apr 10, 2024
97fe9a4
Added initial FSDP readme
Apr 10, 2024
ce05f65
[TorchTrain][Checkpoint] Add model_weights_only option to train_confi…
wz337 Apr 11, 2024
00293cb
Rename to torchtitan (#221)
wanchaol Apr 11, 2024
c08f617
[TorchTitan] Add destory process group at the end of training (#223)
wz337 Apr 12, 2024
7712f72
Add 1 sec delay to rank 0 cleanup (#224)
gnadathur Apr 12, 2024
71621a2
[Torchtrain][Checkpoint] Add support to allow dtype conversion (#222)
wz337 Apr 12, 2024
5aa0aec
[TorchTitan] Remove checkpoint folder at the end in test_runner.py (#…
wz337 Apr 12, 2024
cb24eb5
codebase cleanup
tianyu-l Apr 15, 2024
3cfdbf2
Update README to reflect positioning (#229)
wanchaol Apr 16, 2024
db04c7e
First release readme (#227)
lessw2020 Apr 16, 2024
f504816
Update licenses and headers (#231)
wanchaol Apr 16, 2024
41fb267
use permalink for logo image (#232)
lessw2020 Apr 16, 2024
82c2518
[TorchTitan][Checkpoint] Move checkpoint folder under dump_folder and…
wz337 Apr 16, 2024
d42a7d1
use combo of html and local file src for logo (#234)
lessw2020 Apr 16, 2024
80103a9
add performance -- infra metrics and loss curves (#237) (#238)
lessw2020 Apr 16, 2024
09e7bec
add license section in readme (#239)
wanchaol Apr 16, 2024
22aa488
[TorchTitan][Checkpoint] Add a step-by-step instruction for checkpoin…
wz337 Apr 16, 2024
81138d6
more license headers (#240)
wanchaol Apr 16, 2024
04f5b82
Update README (#242)
wanchaol Apr 16, 2024
4f6ed9a
Add torchtune checkpoint link, modify product position statement loca…
lessw2020 Apr 16, 2024
cd55a38
Add pyproject and upgrade version (#236)
wanchaol Apr 16, 2024
78b843b
minor doc updates - remove asynch checkpt ref, grammar on prod positi…
lessw2020 Apr 16, 2024
ce0fff0
Fix multi-line string usage (#244)
gnadathur Apr 16, 2024
7b353c8
polish toml files
tianyu-l Apr 16, 2024
bc7fec5
[torchtitan][checkpoint][doc] Minor fix checkpoint doc (#246)
wz337 Apr 16, 2024
a682505
fix default max_seq_len for freq_cis init (#248)
wanchaol Apr 17, 2024
1ea4dee
set max_seq_len before training to make it align with input data (#249)
wanchaol Apr 17, 2024
55c8e48
fix pypi docs
tianyu-l Apr 17, 2024
fd9b498
update dataset to use c4
tianyu-l Apr 18, 2024
978c5c6
Add c4_mini, a local 45K dataset (subset of c4) (#253)
lessw2020 Apr 18, 2024
4020e92
remove logo, update pre-release date to 4/18 (#254)
lessw2020 Apr 18, 2024
51a6f6f
add intro video (#233)
lessw2020 Apr 18, 2024
6aafe3c
add performance file to show convergence with 64 a100s (#255)
lessw2020 Apr 18, 2024
35470ca
Support Llama3 8b/70b (#256)
wanchaol Apr 20, 2024
960e70f
polish llama 3 setup
tianyu-l Apr 22, 2024
e1c116a
reenable integration tests with a test tokenizer (#259)
wanchaol Apr 23, 2024
be432e1
warn supported dataset checks instead of throw (#260)
wanchaol Apr 24, 2024
192ed48
De-dup repeated `freqs_cis` computation code
Apr 24, 2024
f38766e
update readme.md and performance.md
tianyu-l Apr 24, 2024
0eacbae
followup changes to allow unsupported datasets
tianyu-l Apr 24, 2024
217cc94
fix ac 'checkpointing' spelling, minor spacing tweaks (#265)
lessw2020 Apr 24, 2024
e3b47ea
Update legal terms (#269)
lessw2020 Apr 25, 2024
eed7495
apply less heavy profiling
tianyu-l Apr 25, 2024
3393c2a
Showcase where the product positioning lies more clearly (#272)
soumith Apr 25, 2024
568dad6
Doc Fixes (#273)
msaroufim Apr 25, 2024
3e13e24
fix lr scheduling by checkpointing scheduler
tianyu-l Apr 26, 2024
f03c128
insert barrier to profiler to resolve collectives timeout
tianyu-l Apr 25, 2024
42549a9
some misc changes (#278)
wanchaol Apr 26, 2024
0d09a32
inherit stateful protocol where appropriate
tianyu-l Apr 26, 2024
06da6c2
Fixed docs on HSDP sharding/replication dims
Apr 29, 2024
a843abf
Add more Float8 description (#284)
drisspg Apr 29, 2024
d442743
Remove unneeded torchvision/audio deps
wconstab Apr 29, 2024
e7f2d28
fix 3d mesh order (#288)
wanchaol Apr 30, 2024
4e5ffaf
unify data loading from HF and from disk
tianyu-l Apr 30, 2024
58b1169
Add periodic integration test with signal (#289)
gnadathur May 1, 2024
4d8c245
exclude embedding in MFU computation
tianyu-l Apr 26, 2024
17cda29
Add support for seed checkpoint creation for meta-init flow
wconstab May 2, 2024
1a6caf2
remove unnecessary install of torchtitan
tianyu-l May 2, 2024
787a571
Remove unnecessary .to() inside model forward
wconstab May 2, 2024
695bd01
Fix the incorrect step log for profiler after resuming from a checkpo…
fegin May 3, 2024
143b586
turn off dynamic shape for torch.compile (#297)
wanchaol May 3, 2024
f72a2a0
Renamed `bsz` to `bs` for consistency; removed dead code
May 3, 2024
3295448
Implement async_checkpoint
fegin May 7, 2024
f5a3ad7
simplify embedding + first transformer block TP (#314)
wanchaol May 8, 2024
a08a70b
selective compilation - norm layers only
lessw2020 May 10, 2024
02cc5c4
lint
lessw2020 May 10, 2024
f249e26
update config mgr and other tomls
lessw2020 May 10, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion .flake8
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,9 @@ max-line-length = 120
# N812 ignored because import torch.nn.functional as F is PyTorch convention
# N817 ignored because importing using acronyms is convention (DistributedDataParallel as DDP)
# E731 allow usage of assigning lambda expressions
# N803,N806 allow caps and mixed case in function params. This is to work with Triton kernel coding style.
ignore =
E203,E305,E402,E501,E721,E741,F405,F821,F841,F999,W503,W504,C408,E302,W291,E303,N812,N817,E731
E203,E305,E402,E501,E721,E741,F405,F821,F841,F999,W503,W504,C408,E302,W291,E303,N812,N817,E731,N803,N806
# shebang has extra meaning in fbcode lints, so I think it's not worth trying
# to line this up with executable bit
EXE001,
Expand Down
42 changes: 42 additions & 0 deletions .github/workflows/integration_test_periodic.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
name: GPU Integration Test

on:
schedule:
# Runs hourly
- cron: '0 * * * *'

concurrency:
group: unit-test${{ github.workflow }}-${{ github.ref == 'refs/heads/main' && github.run_number || github.ref }}
cancel-in-progress: true

defaults:
run:
shell: bash -l -eo pipefail {0}

jobs:
unit_tests_4gpu:
runs-on: linux.g5.12xlarge.nvidia.gpu
strategy:
matrix:
python-version: ['3.10']
steps:
- name: Check out repo
uses: actions/checkout@v3
- name: Setup conda env
uses: conda-incubator/setup-miniconda@v2
with:
auto-update-conda: true
miniconda-version: "latest"
activate-environment: test
python-version: ${{ matrix.python-version }}
- name: Update pip
run: python -m pip install --upgrade pip
- name: Install dependencies
run: |
pip3 install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu121
python -m pip install -r requirements.txt
python -m pip install -r dev-requirements.txt
- name: Run test_runner.py
run: python ./test_runner.py
- name: Upload Coverage to Codecov
uses: codecov/codecov-action@v3
12 changes: 5 additions & 7 deletions .github/workflows/lint.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ jobs:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ['3.11']
python-version: ['3.10']
steps:
- name: Check out repo
uses: actions/checkout@v3
Expand All @@ -30,10 +30,8 @@ jobs:
run: |
python -m pip install pre-commit
pre-commit install-hooks
- id: file_changes
uses: trilom/[email protected]
with:
prNumber: ${{ github.event.number }}
output: ' '
- name: Get changed files
id: changed-files
uses: tj-actions/[email protected]
- name: Lint modified files
run: pre-commit run --files ${{ steps.file_changes.outputs.files }}
run: pre-commit run --files ${{ steps.changed-files.outputs.all_changed_files }}
42 changes: 42 additions & 0 deletions .github/workflows/unit_test_4gpu.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
name: 4 GPU Unit Test

on:
push:
branches: [ main ]
pull_request:

concurrency:
group: unit-test${{ github.workflow }}-${{ github.ref == 'refs/heads/main' && github.run_number || github.ref }}
cancel-in-progress: true

defaults:
run:
shell: bash -l -eo pipefail {0}

jobs:
unit_tests_4gpu:
runs-on: linux.g5.12xlarge.nvidia.gpu
strategy:
matrix:
python-version: ['3.10']
steps:
- name: Check out repo
uses: actions/checkout@v3
- name: Setup conda env
uses: conda-incubator/setup-miniconda@v2
with:
auto-update-conda: true
miniconda-version: "latest"
activate-environment: test
python-version: ${{ matrix.python-version }}
- name: Update pip
run: python -m pip install --upgrade pip
- name: Install dependencies
run: |
pip3 install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu121
python -m pip install -r requirements.txt
python -m pip install -r dev-requirements.txt
- name: Run test_runner.py
run: python ./test_runner.py
- name: Upload Coverage to Codecov
uses: codecov/codecov-action@v3
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
name: Unit Test
name: CPU Unit Test

on:
push:
Expand All @@ -14,7 +14,7 @@ defaults:
shell: bash -l -eo pipefail {0}

jobs:
unit_tests:
cpu_unit_tests:
runs-on: ubuntu-latest
strategy:
matrix:
Expand All @@ -33,10 +33,9 @@ jobs:
run: python -m pip install --upgrade pip
- name: Install dependencies
run: |
pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu121
pip3 install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu121
python -m pip install -r requirements.txt
python -m pip install -r dev-requirements.txt
python -m pip install -e .
- name: Run unit tests with coverage
run: pytest test --cov=. --cov-report=xml --durations=20 -vv
- name: Upload Coverage to Codecov
Expand Down
5 changes: 3 additions & 2 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -4,10 +4,11 @@ __pycache__
*.egg-info
build
outputs
dist/*

# data
data
out
wandb
*.model
*.json

torchtitan/datasets/**/*.model
4 changes: 2 additions & 2 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Contributing to torchtrain
# Contributing to torchtitan
We want to make contributing to this project as easy and transparent as
possible.

Expand Down Expand Up @@ -28,5 +28,5 @@ disclosure of security bugs. In those cases, please go through the process
outlined on that page and do not file a public issue.

## License
By contributing to `torchtrain`, you agree that your contributions will be licensed
By contributing to `torchtitan`, you agree that your contributions will be licensed
under the LICENSE file in the root directory of this source tree.
28 changes: 28 additions & 0 deletions LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
BSD 3-Clause License

Copyright 2024 Meta

Redistribution and use in source and binary forms, with or without modification,
are permitted provided that the following conditions are met:

1. Redistributions of source code must retain the above copyright notice,this list
of conditions and the following disclaimer.

2. Redistributions in binary form must reproduce the above copyright notice, this
list of conditions and the following disclaimer in the documentation
and/or other materials provided with the distribution.

3. Neither the name of the copyright holder nor the names of its contributors may
be used to endorse or promote products derived from this software without specific
prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS “AS IS” AND ANY
EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT
SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED
TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR
BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN
ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH
DAMAGE.
123 changes: 112 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,26 +1,127 @@
# torchtrain
[![GPU Integration Test](https://github.com/pytorch/torchtitan/actions/workflows/unit_test_4gpu.yaml/badge.svg?branch=main)](https://github.com/pytorch/torchtitan/actions/workflows/unit_test_4gpu.yaml)

Note: This repository is currently under heavy development.
# torchtitan

torchtrain contains PyTorch native parallelisms, tools and utilities to train large models.
`torchtitan` is currently in a pre-release state and under extensive development.

# Installation
`torchtitan` is a proof-of-concept for Large-scale LLM training using native PyTorch. It is (and will continue to be) a repo to showcase PyTorch's latest distributed training features in a clean, minimal codebase. torchtitan is complementary to and not a replacement for any of the great large-scale LLM training codebases such as Megatron, Megablocks, LLM Foundry, Deepspeed, etc. Instead, we hope that the features showcased in torchtitan will be adopted by these codebases quickly. torchtitan is unlikely to ever grow a large community around it.

install PyTorch from source or install the latest pytorch nightly, then install requirements by
Our guiding principles when building `torchtitan`:

```python
* Designed to be easy to understand, use and extend for different training purposes.
* Minimal changes to the model code when applying 1D, 2D, or (soon) 3D Parallel.
* Modular components instead of a monolithic codebase.
* Get started in minutes, not hours!

### Intro video - learn more about torchtitan in under 4 mins:

[![Welcome to torchtitan!](assets/images/titan_play_video.png)](https://youtu.be/ee5DOEqD35I?si=_B94PbVv0V5ZnNKE "Welcome to torchtitan!")

## Pre-Release Updates:
#### (4/25/2024): `torchtitan` is now public but in a pre-release state and under development.
Currently we showcase pre-training **Llama 3 and Llama 2** LLMs of various sizes from scratch. `torchtitan` is tested and verified with the PyTorch nightly version `torch-2.4.0.dev20240412`. (We recommend latest PyTorch nightly).

### Key features available

1. [FSDP2 with per param sharding](docs/fsdp.md)
2. [Tensor Parallel](https://pytorch.org/docs/stable/distributed.tensor.parallel.html)
3. Selective layer and operator activation checkpointing
4. Distributed checkpointing
5. 2 datasets pre-configured (45K - 144M)
6. GPU usage, MFU, tokens per second and more displayed via TensorBoard
6. Learning rate scheduler, meta init, Optional Fused RMSNorm
7. All options easily configured via [toml files](train_configs/)
8. [Interoperable checkpoints](docs/checkpoint.md) which can be loaded directly into [`torchtune`](https://github.com/pytorch/torchtune) for fine tuning

We report our [Performance](docs/performance.md) verified on 64 A100 GPUs


### Coming soon
1. Async checkpointing
2. FP8 support
3. Context Parallel
4. 3D Pipeline Parallel
5. `torch.compile` support
6. Scalable data loading solution


## Installation

```bash
git clone https://github.com/pytorch/torchtitan
cd torchtitan
pip install -r requirements.txt
pip3 install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu121 # or cu118
```

### Downloading a tokenizer

`torchtitan` currently supports training Llama 3 (8B, 70B), and Llama 2 (7B, 13B, 70B) out of the box. To get started training these models, we need to download a tokenizer.model. Follow the instructions on the official [meta-llama](https://huggingface.co/meta-llama/Meta-Llama-3-8B) repository to ensure you have access to the Llama model weights.

Once you have confirmed access, you can run the following command to download the Llama 3 / Llama 2 tokenizer to your local machine.

```bash
# Get your HF token from https://huggingface.co/settings/tokens

# llama3 tokenizer.model
python torchtitan/datasets/download_tokenizer.py --repo_id meta-llama/Meta-Llama-3-8B --tokenizer_path "original" --hf_token=...

# llama2 tokenizer.model
python torchtitan/datasets/download_tokenizer.py --repo_id meta-llama/Llama-2-13b-hf --hf_token=...
```

### Start a training run
Llama 3 8B model locally on 8 GPUs

```bash
CONFIG_FILE="./train_configs/llama3_8b.toml" ./run_llama_train.sh
```


## TensorBoard

To visualize TensorBoard metrics of models trained on a remote server via a local web browser:

1. Make sure `metrics.enable_tensorboard` option is set to true in model training (either from a .toml file or from CLI).

2. Set up SSH tunneling, by running the following from local CLI
```
ssh -L 6006:127.0.0.1:6006 [username]@[hostname]
```

3. Inside the SSH tunnel that logged into the remote server, go to the torchtitan repo, and start the TensorBoard backend
```
tensorboard --logdir=./outputs/tb
```

4. In the local web browser, go to the URL it provides OR to http://localhost:6006/.


download tokenizer from HF
This part is needed first time if there's no tokenizer locally by run:
## Multi-Node Training
For training on ParallelCluster/Slurm type configurations, you can use the `multinode_trainer.slurm` file to submit your sbatch job.

To get started adjust the number of nodes and GPUs
```
python torchtrain/datasets/download_tokenizer.py --hf_token your_token
#SBATCH --ntasks=2
#SBATCH --nodes=2
```

run the llama debug model locally to verify the setup is correct:
Then start a run where `nnodes` is your total node count, matching the sbatch node count above.

```
./run_llama_train.sh
srun torchrun --nnodes 2
```

If your gpu count per node is not 8, adjust:

```--nproc_per_node```

in the torchrun command and

```#SBATCH --gpus-per-task```

in the SBATCH command section.

## License

This code is made available under [BSD 3 license](./LICENSE). However you may have other legal obligations that govern your use of other content, such as the terms of service for third-party models, data, etc.
Binary file added assets/images/TorchTitan_logo_main.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/images/llama2_loss_curves.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/images/llama3_loss_curves.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions assets/images/readme.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
images folder for main repo
Binary file added assets/images/titan_play_video.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
36 changes: 36 additions & 0 deletions create_seed_checkpoint.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
#!/usr/bin/bash
# Copyright (c) Meta Platforms, Inc. and affiliates.
# All rights reserved.

# This source code is licensed under the BSD-style license found in the
# LICENSE file in the root directory of this source tree.

#
# create_seed_checkpoint.sh
#
# Run this script to create a seed checkpoint used to initialize a model from step-0.
# Seed checkpoints are used to initialize pipeline-parallel models since the model initializer
# functions don't cleanly run on chunked model parts after meta-initialization.
#
# Use the same model config to generate your seed checkpoint as you use for training.
# e.g.
# CONFIG=<path to model_config> ./create_seed_checkpoint.sh

set -ex

export USE_LIBUV=1
TRAINER_DIR=${1:-/home/$USER/local/torchtitan}
NGPU=1
LOG_RANK=0
CONFIG_FILE=${CONFIG_FILE:-"./train_configs/debug_model.toml"}

seed_checkpoint="--checkpoint.enable_checkpoint --checkpoint.create_seed_checkpoint"
force_1d="--training.data_parallel_degree 1 --training.tensor_parallel_degree 1 --training.pipeline_parallel_degree 1"
overrides=""
if [ $# -ne 0 ]; then
overrides="$*"
fi

torchrun --nproc_per_node=${NGPU} --rdzv_backend c10d --rdzv_endpoint="localhost:0" \
--local-ranks-filter ${LOG_RANK} --role rank --tee 3 \
train.py --job.config_file ${CONFIG_FILE} $seed_checkpoint $force_1d $overrides
Loading