Add integration test with compile enabled (#183)

gnadathur · gnadathur · web-flow · commit b4ab627c3800 · 2024-04-02T12:10:49.000-07:00
Summary:
same as title

Test Plan:
```

+ export USE_LIBUV=1
+ USE_LIBUV=1
+ TRAINER_DIR=/home/gnadathur/local/torchtrain
+ NGPU=4
+ LOG_RANK=0,1
+ CONFIG_FILE=./train_configs/debug_model_compile.toml
+ torchrun --nproc_per_node=4 --rdzv_endpoint=localhost:5972 --local-ranks-filter 0,1 --role rank --tee 3 train.py --job.config_file ./train_configs/debug_model_compile.toml
W0401 17:54:33.567000 139955931223040 torch/distributed/run.py:757]
W0401 17:54:33.567000 139955931223040 torch/distributed/run.py:757] *****************************************
W0401 17:54:33.567000 139955931223040 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0401 17:54:33.567000 139955931223040 torch/distributed/run.py:757] *****************************************
[rank0]:2024-04-01 17:54:35,779 - root - INFO - Starting job: LLaMA debug training
[rank1]:2024-04-01 17:54:35,797 - root - INFO - Starting job: LLaMA debug training
[rank0]:2024-04-01 17:54:36,063 - root - WARNING - ENV[TORCH_NCCL_ASYNC_ERROR_HANDLING] = 1 will be overridden to 3 based on job config
[rank0]:2024-04-01 17:54:36,069 - root - INFO - Building 1-D device mesh with ['dp'], [4]
[rank0]:2024-04-01 17:54:36,071 - root - INFO - Building sentencepiece tokenizer locally from ./torchtrain/datasets/tokenizer/tokenizer.model
[rank0]:2024-04-01 17:54:36,078 - root - INFO - SentencePieceTokenizer built: #words 32000, BOS ID 1, EOS ID 2
[rank0]:2024-04-01 17:54:36,078 - root - INFO - Preparing alpaca dataset from HuggingFace
[rank1]:2024-04-01 17:54:36,449 - root - WARNING - ENV[TORCH_NCCL_ASYNC_ERROR_HANDLING] = 1 will be overridden to 3 based on job config
[rank1]:2024-04-01 17:54:36,454 - root - INFO - Building 1-D device mesh with ['dp'], [4]
[rank1]:2024-04-01 17:54:36,456 - root - INFO - Building sentencepiece tokenizer locally from ./torchtrain/datasets/tokenizer/tokenizer.model
[rank1]:2024-04-01 17:54:36,463 - root - INFO - SentencePieceTokenizer built: #words 32000, BOS ID 1, EOS ID 2
[rank1]:2024-04-01 17:54:36,463 - root - INFO - Preparing alpaca dataset from HuggingFace
[rank0]:2024-04-01 17:54:37,631 - root - INFO - Building llama debugmodel with ModelArgs(dim=256, n_layers=2, n_heads=16, n_kv_heads=None, vocab_size=32000, multiple_of=256, ffn_dim_multiplier=None, norm_eps=1e-05, max_batch_size=32, max_seq_len=32768, depth_init=True)
[rank0]:2024-04-01 17:54:37,643 - root - INFO - �[34mModel llama debugmodel �[31msize: 18,089,216 total parameters�[39m
[rank0]:2024-04-01 17:54:37,644 - root - INFO - GPU capacity: NVIDIA H100 (0) with 95.04GiB memory
[rank0]:2024-04-01 17:54:37,653 - root - INFO - Applied selective activation checkpointing to the model
[rank0]:2024-04-01 17:54:37,653 - root - INFO - Applied FSDP to the model
[rank1]:2024-04-01 17:54:38,310 - root - INFO - Building llama debugmodel with ModelArgs(dim=256, n_layers=2, n_heads=16, n_kv_heads=None, vocab_size=32000, multiple_of=256, ffn_dim_multiplier=None, norm_eps=1e-05, max_batch_size=32, max_seq_len=32768, depth_init=True)
[rank1]:2024-04-01 17:54:38,324 - root - INFO - �[34mModel llama debugmodel �[31msize: 18,089,216 total parameters�[39m
[rank1]:2024-04-01 17:54:38,325 - root - INFO - GPU capacity: NVIDIA H100 (1) with 95.04GiB memory
[rank1]:2024-04-01 17:54:38,335 - root - INFO - Applied selective activation checkpointing to the model
[rank1]:2024-04-01 17:54:38,335 - root - INFO - Applied FSDP to the model
[rank1]:2024-04-01 17:54:38,699 - root - INFO - Gradient scaling not enabled
[rank1]:2024-04-01 17:54:38,699 - root - INFO - Metrics logging active. Tensorboard logs will be saved at ./outputs/tb/20240401-1754
[rank1]:2024-04-01 17:54:38,701 - root - INFO - Compiling model with torch.compile
[rank0]:2024-04-01 17:54:38,692 - root - INFO - Gradient scaling not enabled
[rank0]:2024-04-01 17:54:38,693 - root - INFO - Metrics logging active. Tensorboard logs will be saved at ./outputs/tb/20240401-1754
[rank0]:2024-04-01 17:54:38,694 - root - INFO - Compiling model with torch.compile
[rank0]:2024-04-01 17:54:39,390 - root - INFO - Profiling active. Traces will be saved at ./outputs/profiling/traces
[rank1]:2024-04-01 17:54:39,390 - root - INFO - Profiling active. Traces will be saved at ./outputs/profiling/traces
[rank1]:/data/users/gnadathur/a/pytorch/torch/_inductor/lowering.py:1789: UserWarning: Torchinductor does not support code generation for complex operators. Performance may be worse than eager.
[rank1]:  warnings.warn(
[rank0]:/data/users/gnadathur/a/pytorch/torch/_inductor/lowering.py:1789: UserWarning: Torchinductor does not support code generation for complex operators. Performance may be worse than eager.
[rank0]:  warnings.warn(
[rank1]:2024-04-01 17:54:40,498 - root - INFO - running build_ext
[rank0]:2024-04-01 17:54:40,493 - root - INFO - running build_ext
[rank1]:2024-04-01 17:54:41,992 - root - INFO - running build_ext
[rank0]:2024-04-01 17:54:41,985 - root - INFO - running build_ext
[rank1]:2024-04-01 17:54:42,180 - root - INFO - running build_ext
[rank0]:2024-04-01 17:54:42,187 - root - INFO - running build_ext
[rank1]:2024-04-01 17:54:43,947 - root - INFO - running build_ext
[rank1]:2024-04-01 17:54:43,963 - root - INFO - running build_ext
[rank1]:2024-04-01 17:54:43,971 - root - INFO - running build_ext
[rank0]:2024-04-01 17:54:43,920 - root - INFO - running build_ext
[rank0]:2024-04-01 17:54:43,951 - root - INFO - running build_ext
[rank0]:2024-04-01 17:54:43,974 - root - INFO - running build_ext
[rank1]:2024-04-01 17:54:44,029 - root - INFO - running build_ext
[rank0]:2024-04-01 17:54:44,033 - root - INFO - running build_ext
[rank1]:2024-04-01 17:54:45,907 - root - INFO - running build_ext
[rank0]:2024-04-01 17:54:45,933 - root - INFO - running build_ext
[rank1]:2024-04-01 17:54:47,561 - root - INFO - running build_ext
[rank1]:2024-04-01 17:54:47,667 - root - INFO - running build_ext
[rank0]:2024-04-01 17:54:47,649 - root - INFO - running build_ext
[rank0]:2024-04-01 17:54:47,706 - root - INFO - running build_ext
[rank1]:2024-04-01 17:54:49,084 - root - INFO - running build_ext
[rank1]:2024-04-01 17:54:49,108 - root - INFO - running build_ext
[rank1]:2024-04-01 17:54:49,110 - root - INFO - running build_ext
[rank0]:2024-04-01 17:54:49,086 - root - INFO - running build_ext
[rank0]:2024-04-01 17:54:49,114 - root - INFO - running build_ext
[rank0]:2024-04-01 17:54:49,131 - root - INFO - running build_ext
[rank0]:2024-04-01 17:54:50,546 - root - INFO - running build_ext
[rank1]:2024-04-01 17:54:50,638 - root - INFO - running build_ext
[rank0]:2024-04-01 17:54:51,901 - root - INFO - running build_ext
[rank1]:2024-04-01 17:54:52,025 - root - INFO - running build_ext
[rank1]:2024-04-01 17:54:52,734 - root - INFO - �[36mstep:  1  �[32mloss: 10.9746  �[33mmemory:  9.53GiB(10.03%)  �[34mwps: 1,228  �[35mmfu: 0.02%�[39m
[rank1]:2024-04-01 17:54:52,734 - root - INFO - Synchronizing and adjusting timeout for all ProcessGroups to 0:00:05
[rank1]:2024-04-01 17:54:52,813 - root - INFO - �[36mstep:  2  �[32mloss: 10.9091  �[33mmemory:  9.54GiB(10.03%)  �[34mwps: 208,739  �[35mmfu: 2.56%�[39m
[rank0]:2024-04-01 17:54:52,734 - root - INFO - �[36mstep:  1  �[32mloss: 10.9746  �[33mmemory:  9.53GiB(10.03%)  �[34mwps: 1,228  �[35mmfu: 0.02%�[39m
[rank0]:2024-04-01 17:54:52,734 - root - INFO - Synchronizing and adjusting timeout for all ProcessGroups to 0:00:05
[rank0]:2024-04-01 17:54:52,813 - root - INFO - �[36mstep:  2  �[32mloss: 10.9091  �[33mmemory:  9.54GiB(10.03%)  �[34mwps: 208,501  �[35mmfu: 2.55%�[39m
[rank1]:2024-04-01 17:54:52,889 - root - INFO - �[36mstep:  3  �[32mloss: 10.7722  �[33mmemory:  9.54GiB(10.03%)  �[34mwps: 219,416  �[35mmfu: 2.69%�[39m
[rank0]:2024-04-01 17:54:52,889 - root - INFO - �[36mstep:  3  �[32mloss: 10.7722  �[33mmemory:  9.54GiB(10.03%)  �[34mwps: 219,182  �[35mmfu: 2.68%�[39m
[rank1]:2024-04-01 17:54:52,965 - root - INFO - �[36mstep:  4  �[32mloss: 10.5428  �[33mmemory:  9.54GiB(10.03%)  �[34mwps: 218,226  �[35mmfu: 2.67%�[39m
[rank0]:2024-04-01 17:54:52,965 - root - INFO - �[36mstep:  4  �[32mloss: 10.5428  �[33mmemory:  9.54GiB(10.03%)  �[34mwps: 218,015  �[35mmfu: 2.67%�[39m
[rank1]:2024-04-01 17:54:53,045 - root - INFO - �[36mstep:  5  �[32mloss: 10.3063  �[33mmemory:  9.54GiB(10.03%)  �[34mwps: 207,094  �[35mmfu: 2.54%�[39m
[rank0]:2024-04-01 17:54:53,045 - root - INFO - �[36mstep:  5  �[32mloss: 10.3063  �[33mmemory:  9.54GiB(10.03%)  �[34mwps: 207,220  �[35mmfu: 2.54%�[39m
[rank1]:2024-04-01 17:54:53,123 - root - INFO - �[36mstep:  6  �[32mloss: 10.0707  �[33mmemory:  9.54GiB(10.03%)  �[34mwps: 210,814  �[35mmfu: 2.58%�[39m
[rank1]:2024-04-01 17:54:53,202 - root - INFO - �[36mstep:  7  �[32mloss:  9.8302  �[33mmemory:  9.54GiB(10.03%)  �[34mwps: 209,649  �[35mmfu: 2.57%�[39m
[rank0]:2024-04-01 17:54:53,123 - root - INFO - �[36mstep:  6  �[32mloss: 10.0707  �[33mmemory:  9.54GiB(10.03%)  �[34mwps: 210,849  �[35mmfu: 2.58%�[39m
[rank0]:2024-04-01 17:54:53,202 - root - INFO - �[36mstep:  7  �[32mloss:  9.8302  �[33mmemory:  9.54GiB(10.03%)  �[34mwps: 209,542  �[35mmfu: 2.57%�[39m
[rank0]:2024-04-01 17:54:53,281 - root - INFO - �[36mstep:  8  �[32mloss:  9.5918  �[33mmemory:  9.54GiB(10.03%)  �[34mwps: 211,690  �[35mmfu: 2.59%�[39m
[rank1]:2024-04-01 17:54:53,281 - root - INFO - �[36mstep:  8  �[32mloss:  9.5918  �[33mmemory:  9.54GiB(10.03%)  �[34mwps: 211,786  �[35mmfu: 2.59%�[39m
[rank1]:2024-04-01 17:54:53,412 - root - INFO - �[36mstep:  9  �[32mloss:  9.4299  �[33mmemory:  9.54GiB(10.03%)  �[34mwps: 125,833  �[35mmfu: 1.54%�[39m
[rank1]:[rank1]:[W401 17:54:53.242673953 CPUAllocator.cpp:249] Memory block of unknown size was allocated before the profiling started, profiler results will not include the deallocation event
[rank0]:2024-04-01 17:54:53,412 - root - INFO - �[36mstep:  9  �[32mloss:  9.4299  �[33mmemory:  9.54GiB(10.03%)  �[34mwps: 125,765  �[35mmfu: 1.54%�[39m
[rank0]:[rank0]:[W401 17:54:53.240925776 CPUAllocator.cpp:249] Memory block of unknown size was allocated before the profiling started, profiler results will not include the deallocation event
[rank1]:2024-04-01 17:54:53,492 - root - INFO - �[36mstep: 10  �[32mloss:  9.2955  �[33mmemory:  9.54GiB(10.03%)  �[34mwps: 207,661  �[35mmfu: 2.54%�[39m
[rank0]:2024-04-01 17:54:53,492 - root - INFO - �[36mstep: 10  �[32mloss:  9.2955  �[33mmemory:  9.54GiB(10.03%)  �[34mwps: 207,426  �[35mmfu: 2.54%�[39m
[rank0]:NCCL version 2.20.5+cuda12.0
```

Reviewers:

Subscribers:

Tasks:

Tags:

---------

Co-authored-by: gnadathur &lt;gnadathur@devvm4378.nao0.facebook.com&gt;
diff --git a/run_llama_train.sh b/run_llama_train.sh
@@ -19,6 +19,11 @@ LOG_RANK=${LOG_RANK:-0}
 
 CONFIG_FILE=${CONFIG_FILE:-"./train_configs/debug_model.toml"}
 
+overrides=""
+if [ $# -ne 0 ]; then
+    overrides="$*"
+fi
+
 torchrun --nproc_per_node=${NGPU} --rdzv_endpoint="localhost:5972" \
 --local-ranks-filter ${LOG_RANK} --role rank --tee 3 \
-train.py --job.config_file ${CONFIG_FILE}
+train.py --job.config_file ${CONFIG_FILE} $overrides
diff --git a/test/test_runner.py b/test/test_runner.py
@@ -5,27 +5,64 @@
 # All rights reserved.
 import os
 import subprocess
+from collections import defaultdict
+from dataclasses import dataclass
+from typing import Sequence
 
 try:
     import tomllib
 except ModuleNotFoundError:
     import tomli as tomllib
 
+
+@dataclass
+class OverrideDefinitions:
+    """
+    This class is used to define the override definitions for the integration tests.
+    """
+
+    override_args: Sequence[str] = tuple()
+    test_descr: str = "default"
+
+
 CONFIG_DIR = "./train_configs"
+
+"""
+key is the config file name and value is a list of OverrideDefinitions
+that is used to generate variations of integration tests based on the
+same root config file.
+"""
+integration_tests_flavors = defaultdict(list)
+integration_tests_flavors["debug_model.toml"] = [
+    OverrideDefinitions(["--training.compile"], "1D compile"),
+    OverrideDefinitions(
+        ["--training.tensor_parallel_degree 2"], "Eager mode 2DParallel"
+    ),
+]
+
+
 for config_file in os.listdir(CONFIG_DIR):
     if config_file.endswith(".toml"):
         full_path = os.path.join(CONFIG_DIR, config_file)
         with open(full_path, "rb") as f:
             config = tomllib.load(f)
             is_integration_test = config["job"].get("use_for_integration_test", False)
             if is_integration_test:
-                cmd = f"CONFIG_FILE={full_path} NGPU=4 ./run_llama_train.sh"
-                print(f"=====Integration test: {cmd}=====")
-                result = subprocess.run(
-                    [cmd],
-                    stdout=subprocess.PIPE,
-                    stderr=subprocess.STDOUT,
-                    text=True,
-                    shell=True,
-                )
-                print(result.stdout)
+                test_flavors = [OverrideDefinitions()] + integration_tests_flavors[
+                    config_file
+                ]
+                for test_flavor in test_flavors:
+                    cmd = f"CONFIG_FILE={full_path} NGPU=4 ./run_llama_train.sh"
+                    if test_flavor.override_args:
+                        cmd += " " + " ".join(test_flavor.override_args)
+                    print(
+                        f"=====Integration test, flavor : {test_flavor.test_descr}, command : {cmd}====="
+                    )
+                    result = subprocess.run(
+                        [cmd],
+                        stdout=subprocess.PIPE,
+                        stderr=subprocess.STDOUT,
+                        text=True,
+                        shell=True,
+                    )
+                    print(result.stdout)
diff --git a/train_configs/debug_model_2d.toml b/train_configs/debug_model_2d.toml