Make freqs_cis a persistent buffer for pp init

wconstab · wconstab · commit 60b1173c7100 · 2024-04-05T14:26:01.000-07:00
currently, planning to use a 'seed checkpoint' to initialize the pipeline parallel model chunks after moving them from meta device to cuda/empty. non-persistent buffers are incompatible with this approach, as they are missing from the checkpoint and thus require manual init. an alternative is to manually run the initializer for just the non-persistent buffers after loading a seed-checkpoint, but this approach is nearly equivalent with less code changes. ghstack-source-id: b482284 Pull Request resolved: #201
diff --git a/torchtrain/models/llama/model.py b/torchtrain/models/llama/model.py
@@ -309,9 +309,15 @@ def __init__(self, model_args: ModelArgs):
         super().__init__()
         self.model_args = model_args
         self.tok_embeddings = nn.Embedding(model_args.vocab_size, model_args.dim)
-        self.register_buffer(
-            "freqs_cis", self._precompute_freqs_cis(), persistent=False
-        )
+
+        # TODO persistent should be set to false, since this buffer can be recomputed.
+        # however, we set it to true for 2 reasons.  (1) due to pytorch/pytorch#123411,
+        # compile or pipeline-tracer will not correctly handle non-persistent buffers,
+        # so we need to fix that.  (2) if we initialize pipeline-parallel models from
+        # a seed checkpoint rather than calling init_weights, we need freqs_cis to be
+        # initialized by the checkpoint, or we need to add a separate initializer for
+        # just the non-persistent buffers that is called after loading checkpoints.
+        self.register_buffer("freqs_cis", self._precompute_freqs_cis(), persistent=True)
 
     def _precompute_freqs_cis(self):
         return precompute_freqs_cis(