Skip to content

Commit 3fce6bb

Browse files
authored
move config folder to root and adjust options (#83)
as titled, move the config files to the root folder, where it decouples with the torchtrain package build, and allow easier navigations
1 parent 468ce8f commit 3fce6bb

File tree

4 files changed

+9
-18
lines changed

4 files changed

+9
-18
lines changed

run_llama_train.sh

Lines changed: 2 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -6,24 +6,15 @@ TRAINER_DIR=${1:-/home/$USER/local/torchtrain}
66

77
# use envs as local overrides for convenience
88
# e.g.
9-
# LOG_RANK=0,1 NGPU=4 SP=2 ./run_llama_train.sh
9+
# LOG_RANK=0,1 NGPU=4 ./run_llama_train.sh
1010

11-
MODEL=${MODEL:-"llama"}
12-
MODEL_CONF=${MODEL_CONF:-"debugmodel"}
1311
NGPU=${NGPU:-"8"}
14-
PP=${PP:-"1"}
15-
SP=${SP:-"1"}
16-
DP=${DP:-"-1"}
1712

1813
# by default log just rank 0 output,
1914
LOG_RANK=${LOG_RANK:-0}
2015

21-
# Change this string to a meaningful one to enable checkpoint
22-
CHECKPOINT_FOLDER=${CHECKPOINT_FOLDER:-""}
23-
# Please adjust this to a longer interval period. The unit of measurement is in steps.
24-
CHECKPOINT_INTERVAL=${CHECKPOINT_INTERVAL:-5}
2516

26-
CONFIG_FILE=${CONFIG_FILE:-"./torchtrain/train_configs/train_config.toml"}
17+
CONFIG_FILE=${CONFIG_FILE:-"./train_configs/debug_model.toml"}
2718

2819
torchrun --nproc_per_node=${NGPU} --rdzv_endpoint="localhost:5972" \
2920
--local-ranks-filter ${LOG_RANK} --role rank --tee 3 \

test/test_job_config.py

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,6 @@
1+
# Copyright (c) Meta Platforms, Inc. and affiliates.
2+
# This software may be used and distributed according to the terms of the Llama 2 Community License Agreement.
3+
14
import pytest
25
from torchtrain.config_manager import JobConfig
36

@@ -10,9 +13,7 @@ def test_command_line_args(self):
1013

1114
def test_job_config_file(self):
1215
config = JobConfig()
13-
config.parse_args(
14-
["--job.config_file", "./torchtrain/train_configs/train_config.toml"]
15-
)
16+
config.parse_args(["--job.config_file", "./train_configs/debug_model.toml"])
1617
assert config.model.name == "llama"
1718

1819
def test_job_file_does_not_exist(self):

torchtrain/train_configs/__init_.py

Lines changed: 0 additions & 1 deletion
This file was deleted.

torchtrain/train_configs/train_config.toml renamed to train_configs/debug_model.toml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# TorchTrain Config.toml
22
[job]
3-
dump_folder = "./torchtrain/outputs"
3+
dump_folder = "./outputs"
44

55
[profiling]
66
run_profiler = true
@@ -26,8 +26,8 @@ lr = 8e-4
2626
[training]
2727
batch_size = 8
2828
seq_len = 2048
29-
warmup_pct = 0.20
30-
max_norm = 1.0
29+
warmup_pct = 0.20 # lr scheduler warm up
30+
max_norm = 1.0 # grad norm clipping
3131
steps = 10
3232
data_parallel_degree = -1
3333
sequence_parallel_degree = 1

0 commit comments

Comments
 (0)