-
Notifications
You must be signed in to change notification settings - Fork 500
[BE][doc] add memory_profiler to README #606
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
README.md
Outdated
@@ -42,6 +42,7 @@ You may want to see how the model is defined or how parallelism techniques are a | |||
10. DDP and HSDP | |||
11. All options easily configured via [toml files](train_configs/) | |||
12. [Interoperable checkpoints](docs/checkpoint.md) which can be loaded directly into [`torchtune`](https://github.com/pytorch/torchtune) for fine-tuning | |||
13. [Memory profier](docs/memory_profiler.md) dump memory snapshots |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe modify ine 38? "6. Loss, GPU memory, tokens-per-second, and MFU displayed and logged via TensorBoard" ?
this is a debugging feature instead of a major PTD feature
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should add an item in the Debugging section together with Flight Recorder. Personally I don't mind creating a new entry in "key features" about debugging tools, but they should be both included or both excluded here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Combine them together in line 45 as bullet point 13.
docs/memory_profiler.md
Outdated
* `--profiling.enable_memory_snapshot`: enable memory snapshot | ||
* `--profiling.save_memory_snapshot_folder`: dump memory snapshots in to output folder, default to be `memory_snapshot`. | ||
+ If in case of OOMs. output folder is `memory_snapshot/iteration_x_exit`. | ||
+ If regularly according to `profile_freq`. output folder is `memory_snapshot/iteration_x`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
explain how people should open the pickle file? like go to browser, drag, and check. we should quote https://pytorch.org/blog/understanding-gpu-memory-1/
docs/memory_profiler.md
Outdated
CONFIG_FILE="./train_configs/debug_model.toml" ./run_llama_train.sh --profiling.enable_memory_snapshot --profiling.save_memory_snapshot_folder output_folder | ||
``` | ||
* `--profiling.enable_memory_snapshot`: enable memory snapshot | ||
* `--profiling.save_memory_snapshot_folder`: dump memory snapshots in to output folder, default to be `memory_snapshot`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default here should be ./outputs/memory_snapshot
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean the default value for this field is memory_snapshot
, but you should mention it would be under the output folder, which is controlled by job.dump_folder
.
README.md
Outdated
@@ -42,6 +42,7 @@ You may want to see how the model is defined or how parallelism techniques are a | |||
10. DDP and HSDP | |||
11. All options easily configured via [toml files](train_configs/) | |||
12. [Interoperable checkpoints](docs/checkpoint.md) which can be loaded directly into [`torchtune`](https://github.com/pytorch/torchtune) for fine-tuning | |||
13. [Memory profier](docs/memory_profiler.md) dump memory snapshots |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should add an item in the Debugging section together with Flight Recorder. Personally I don't mind creating a new entry in "key features" about debugging tools, but they should be both included or both excluded here.
docs/memory_profiler.md
Outdated
``` | ||
* `--profiling.enable_memory_snapshot`: enable memory snapshot | ||
* `--profiling.save_memory_snapshot_folder`: dump memory snapshots in to output folder, default under your output folder to be `./outputs/memory_snapshot`. | ||
+ If in case of OOMs. output folder is `memory_snapshot/iteration_x_exit`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first "." should be replace with ",".
docs/memory_profiler.md
Outdated
* `--profiling.enable_memory_snapshot`: enable memory snapshot | ||
* `--profiling.save_memory_snapshot_folder`: dump memory snapshots in to output folder, default under your output folder to be `./outputs/memory_snapshot`. | ||
+ If in case of OOMs. output folder is `memory_snapshot/iteration_x_exit`. | ||
+ If regularly according to `profile_freq`. output folder is `memory_snapshot/iteration_x`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had some suggestions on writing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good to me, thanks!
Add memory_profiler to README, explain how to use memory profiler with
--profiling.enable_memory_snapshot
and--profiling.save_memory_snapshot_folder