Fix the incorrect step log for profiler after resuming from a checkpoint #293

fegin · 2024-05-02T06:37:07Z

Summary:
The profiler currently maintains a counter locally and that counter is not synchronized with the checkpointed train step. This PR fixes the issue.

Summary: The profiler currently maintains a counter locally and that counter is not synchronized with the checkpointed train step. This PR fixes the issue.

tianyu-l · 2024-05-02T20:14:46Z

torchtitan/profiling.py

@@ -27,8 +27,7 @@ def maybe_enable_profiling(config: JobConfig, *pos_args, **kwargs):
        trace_dir = os.path.join(dump_dir, save_trace_dir)
        profile_freq = config.profiling.profile_freq

-        _global_iter_count = 0
-
+        _global_iter_count = global_step


This does not sync with the internal state step_num of profiler. Let's remove it and replace all appearance of _global_iter_count by prof.step_num in trace_handler.

Also need to set torch_profiler.step_num = global_step on line 71.

tianyu-l

LGTM

Note: if resuming from a global_step such that global_step % profile_freq == profile_freq - 1, there won't be profile trace at step global_step + 1, which would have been done if not resuming from a checkpoint.

Essentially this is caused by profiler doesn't maintain a state dict. This miss may be avoided if we correctly implement save & load functions more carefully, which doesn't seem to be worth it.

fegin · 2024-05-03T00:24:59Z

In some trainers, the profiling is designed to be done only once and the global step is used to prevent profiling from happening after checkpoint resume.

…int (pytorch#293) Summary: The profiler currently maintains a counter locally and that counter is not synchronized with the checkpointed train step. This PR fixes the issue.

Fix the incorrect step log for profiler after resuming from a checkpoint

1cdf21a

Summary: The profiler currently maintains a counter locally and that counter is not synchronized with the checkpointed train step. This PR fixes the issue.

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label May 2, 2024

fegin requested a review from tianyu-l May 2, 2024 06:37

tianyu-l reviewed May 2, 2024

View reviewed changes

Update to use profiler.step_num

34b2773

fegin requested a review from tianyu-l May 2, 2024 22:17

tianyu-l approved these changes May 2, 2024

View reviewed changes

fegin merged commit 695bd01 into main May 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix the incorrect step log for profiler after resuming from a checkpoint #293

Fix the incorrect step log for profiler after resuming from a checkpoint #293

Uh oh!

fegin commented May 2, 2024

Uh oh!

tianyu-l May 2, 2024

Uh oh!

tianyu-l left a comment •

edited

Loading

Uh oh!

fegin commented May 3, 2024

Uh oh!

Uh oh!

Fix the incorrect step log for profiler after resuming from a checkpoint #293

Fix the incorrect step log for profiler after resuming from a checkpoint #293

Uh oh!

Conversation

fegin commented May 2, 2024

Uh oh!

tianyu-l May 2, 2024

Choose a reason for hiding this comment

Uh oh!

tianyu-l left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fegin commented May 3, 2024

Uh oh!

Uh oh!

tianyu-l left a comment •

edited

Loading