Skip to content

Commit f13fe3f

Browse files
authored
[EZ] Add logs for some basic training params so that we can verify in… (#491)
As title, while testing on 405B model, I found that we need to somehow need the logs for some training params. So added some here. Tested locally and the logging is shown as in the screenshot: <img width="900" alt="image" src="https://github.com/user-attachments/assets/b94e34f5-3e88-4c5f-94ed-75f50dde9786">
1 parent 668f6cd commit f13fe3f

File tree

1 file changed

+6
-1
lines changed

1 file changed

+6
-1
lines changed

train.py

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -355,7 +355,12 @@ def loss_fn(pred, labels):
355355
gpu_memory_monitor.reset_peak_stats()
356356

357357
# train loop
358-
logger.info(f"Training starts at step {train_state.step + 1}")
358+
logger.info(
359+
f"Training starts at step {train_state.step + 1}, "
360+
f"with local batch size: {job_config.training.batch_size}, "
361+
f"sequence length: {job_config.training.seq_len}, "
362+
f"total steps: {job_config.training.steps}({job_config.training.warmup_steps}), "
363+
)
359364
with maybe_enable_profiling(
360365
job_config, global_step=train_state.step
361366
) as torch_profiler, maybe_enable_memory_snapshot(

0 commit comments

Comments
 (0)