Skip to content

Conversation

mori360
Copy link
Contributor

@mori360 mori360 commented Oct 8, 2024

Add memory_profiler to README, explain how to use memory profiler with --profiling.enable_memory_snapshot and --profiling.save_memory_snapshot_folder

@mori360 mori360 requested review from weifengpy and tianyu-l October 8, 2024 17:54
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 8, 2024
README.md Outdated
@@ -42,6 +42,7 @@ You may want to see how the model is defined or how parallelism techniques are a
10. DDP and HSDP
11. All options easily configured via [toml files](train_configs/)
12. [Interoperable checkpoints](docs/checkpoint.md) which can be loaded directly into [`torchtune`](https://github.com/pytorch/torchtune) for fine-tuning
13. [Memory profier](docs/memory_profiler.md) dump memory snapshots
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe modify ine 38? "6. Loss, GPU memory, tokens-per-second, and MFU displayed and logged via TensorBoard" ?

this is a debugging feature instead of a major PTD feature

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add an item in the Debugging section together with Flight Recorder. Personally I don't mind creating a new entry in "key features" about debugging tools, but they should be both included or both excluded here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Combine them together in line 45 as bullet point 13.

* `--profiling.enable_memory_snapshot`: enable memory snapshot
* `--profiling.save_memory_snapshot_folder`: dump memory snapshots in to output folder, default to be `memory_snapshot`.
+ If in case of OOMs. output folder is `memory_snapshot/iteration_x_exit`.
+ If regularly according to `profile_freq`. output folder is `memory_snapshot/iteration_x`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

explain how people should open the pickle file? like go to browser, drag, and check. we should quote https://pytorch.org/blog/understanding-gpu-memory-1/

@mori360 mori360 marked this pull request as draft October 8, 2024 21:15
CONFIG_FILE="./train_configs/debug_model.toml" ./run_llama_train.sh --profiling.enable_memory_snapshot --profiling.save_memory_snapshot_folder output_folder
```
* `--profiling.enable_memory_snapshot`: enable memory snapshot
* `--profiling.save_memory_snapshot_folder`: dump memory snapshots in to output folder, default to be `memory_snapshot`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default here should be ./outputs/memory_snapshot

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean the default value for this field is memory_snapshot, but you should mention it would be under the output folder, which is controlled by job.dump_folder.

README.md Outdated
@@ -42,6 +42,7 @@ You may want to see how the model is defined or how parallelism techniques are a
10. DDP and HSDP
11. All options easily configured via [toml files](train_configs/)
12. [Interoperable checkpoints](docs/checkpoint.md) which can be loaded directly into [`torchtune`](https://github.com/pytorch/torchtune) for fine-tuning
13. [Memory profier](docs/memory_profiler.md) dump memory snapshots
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add an item in the Debugging section together with Flight Recorder. Personally I don't mind creating a new entry in "key features" about debugging tools, but they should be both included or both excluded here.

@mori360 mori360 marked this pull request as ready for review October 8, 2024 21:28
@mori360 mori360 marked this pull request as draft October 8, 2024 21:30
@mori360 mori360 marked this pull request as ready for review October 8, 2024 21:31
```
* `--profiling.enable_memory_snapshot`: enable memory snapshot
* `--profiling.save_memory_snapshot_folder`: dump memory snapshots in to output folder, default under your output folder to be `./outputs/memory_snapshot`.
+ If in case of OOMs. output folder is `memory_snapshot/iteration_x_exit`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first "." should be replace with ",".

* `--profiling.enable_memory_snapshot`: enable memory snapshot
* `--profiling.save_memory_snapshot_folder`: dump memory snapshots in to output folder, default under your output folder to be `./outputs/memory_snapshot`.
+ If in case of OOMs. output folder is `memory_snapshot/iteration_x_exit`.
+ If regularly according to `profile_freq`. output folder is `memory_snapshot/iteration_x`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto.

Copy link
Contributor

@tianyu-l tianyu-l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had some suggestions on writing

Copy link
Contributor

@tianyu-l tianyu-l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me, thanks!

@mori360 mori360 merged commit 7408074 into pytorch:main Oct 10, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants