-
Notifications
You must be signed in to change notification settings - Fork 483
add TensorBoard logging with loss and wps #57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
[ghstack-poisoned]
[ghstack-poisoned]
train.py
Outdated
|
||
time_delta = timer() - time_last_log | ||
wps = nwords_since_last_log / ( | ||
time_delta * parallel_dims.sp * parallel_dims.pp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a neater way is to define a model_parallel_size
in the parallel_dims
class that return this number directly (i.e. a cached property)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great! one minor comment, and please update the readme to include how to setup and use tensorboard.
Each rank build its own TensorBoard writer. The global loss is communicated among all ranks before logging. To visualize using SSH tunneling: `ssh -L 6006:127.0.0.1:6006 your_user_namemy_server_ip` in torchtrain repo `tensorboard --logdir=./torchtrain/outputs/tb` then on web browser go to http://localhost:6006/ <img width="722" alt="Screenshot 2024-02-12 at 6 39 28 PM" src="https://github.com/pytorch-labs/torchtrain/assets/150487191/6304103c-fa89-4f1c-a8a2-57c887b07cd3"> [ghstack-poisoned]
ghstack-source-id: d0828f1 Pull Request resolved: pytorch#57
Stack from ghstack (oldest at bottom):
Each rank build its own TensorBoard writer. The global loss is communicated among all ranks before logging.
To visualize using SSH tunneling:
ssh -L 6006:127.0.0.1:6006 your_user_name@my_server_ip
in torchtrain repo
tensorboard --logdir=./torchtrain/outputs/tb
then on web browser go to http://localhost:6006/