deepspeedai · sfc-gh-truwase · Aug 16, 2025 · Jul 3, 2025 · Jul 3, 2025 · Aug 4, 2025
@@ -0,0 +1,74 @@
+# ZenFlow Benchmark Example
+
+
+Please install DeepSpeed via pip install deepspeed if you haven't already done so. 
+
+```bash
+pip install -r requirements.txt
+```
+
+
+The script `zf_benchmark.py ` demonstrates how to offload the state of a model. Here is the example usage.
+
+```python
+$ deepspeed --num_gpus=4 zf_benchmark.py --hidden_dim 4096 --nlayers 4 --iteration 5 --pin_memory_opts 1 --topk_ratios 0.1 --update_intervals 2 --overlap_steps
+...
+time (ms) | selective_optimizer_update: 19.20 | selective_optimizer_process: 28.80 | selective_optimizer_sync: 0.05
+time (ms) | fwd_microstep: 54.76 | bwd_microstep: 122.95 | bwd_inner_microstep: 12.22 | bwd_allreduce_microstep: 103.64 | step_microstep: 0.34
+Step 0 time: 178.66ms
+time (ms) | optimizer_allgather: 26.19 | optimizer_gradients: 26.06 | optimizer_step: 128.20
+time (ms) | selective_optimizer_update: 0.00 | selective_optimizer_process: 0.57 | selective_optimizer_step: 1.48 | selective_optimizer_sync: 0.00
+time (ms) | fwd_microstep: 0.38 | bwd_microstep: 57.88 | bwd_inner_microstep: 1.06 | bwd_allreduce_microstep: 56.50 | step_microstep: 183.27
+time (ms) | fwd: 55.15 | bwd: 180.82 | bwd_inner: 13.28 | bwd_allreduce: 160.15 | step: 183.61
+Step 1 time: 242.16ms
+time (ms) | selective_optimizer_update: 0.00 | selective_optimizer_process: 1.58 | selective_optimizer_step: 0.00 | selective_optimizer_sync: 0.00
+time (ms) | fwd_microstep: 0.30 | bwd_microstep: 16.73 | bwd_inner_microstep: 1.39 | bwd_allreduce_microstep: 14.96 | step_microstep: 0.20
+Step 2 time: 17.60ms
+time (ms) | optimizer_allgather: 0.65 | optimizer_gradients: 16.95 | optimizer_step: 108.45
+time (ms) | selective_optimizer_update: 0.00 | selective_optimizer_process: 0.56 | selective_optimizer_step: 1.42 | selective_optimizer_sync: 0.00
+time (ms) | fwd_microstep: 0.29 | bwd_microstep: 36.65 | bwd_inner_microstep: 0.95 | bwd_allreduce_microstep: 35.51 | step_microstep: 128.57
+time (ms) | fwd: 0.59 | bwd: 53.39 | bwd_inner: 2.33 | bwd_allreduce: 50.48 | step: 128.77
+Step 3 time: 166.10ms
+time (ms) | selective_optimizer_update: 0.00 | selective_optimizer_process: 1.57 | selective_optimizer_step: 0.00 | selective_optimizer_sync: 0.00
+time (ms) | fwd_microstep: 0.31 | bwd_microstep: 15.47 | bwd_inner_microstep: 1.33 | bwd_allreduce_microstep: 13.97 | step_microstep: 0.23
+...
+[Summary] pin_memory=False topk_ratio=0.1 update_interval=2 overlap_step=False avg_accumulation_step=16.77ms avg_update_step=171.38ms
+```
+
+`run_benchmark.sh` shows how to run the script with different configurations. The script outputs the time for offloading and loading the states.
+
+```python
+$ ./run_benchmark.sh
+...
++---------+--------------+-------------------+----------------+--------------+-----------------+----------------+----------------+-------------------------------------+
+|   trial |   topk_ratio |   update_interval | overlap_step   | pin_memory   |   avg_step (ms) |   avg_bwd (ms) |   avg_fwd (ms) |   avg_selective_optimizer_step (ms) |
+|---------+--------------+-------------------+----------------+--------------+-----------------+----------------+----------------+-------------------------------------|
+|       1 |          0.1 |                 2 | False          | False        |         24.0153 |        12.8377 |        1.91733 |                            0.247    |
+|       1 |          0.1 |                 2 | False          | False        |         22.8293 |        12.5187 |        1.73767 |                            0.258333 |
+|       1 |          0.1 |                 2 | False          | True         |         21.6523 |        10.2863 |        1.97767 |                            0.250333 |
+|       1 |          0.1 |                 4 | False          | False        |         14.2108 |        10.9072 |        1.2436  |                            0.1484   |
+|       1 |          0.1 |                 4 | False          | False        |         13.6408 |        10.8386 |        1.2208  |                            0.1456   |
+|       1 |          0.1 |                 4 | False          | True         |         12.863  |         9.0592 |        1.2148  |                            0.1464   |...
+```
+
+
+**Notes:** Each row in the table represents the average performance metrics for a specific configuration of ZenFlow’s offloading setup, defined by:
+
+- **`topk_ratio`**: The fraction of parameters selected for offloading during each update.
+- **`update_interval`**: How often (in steps) the offloading state is updated.
+- **`overlap_step`**: Whether overlapping offloading with computation is enabled.
+- **`pin_memory`**: Whether pinned host memory is used to speed up data transfer between CPU and GPU.
+
+The performance metrics include:
+
+- **`avg_step (ms)`**: Total time per training step — the primary measure of end-to-end training performance.
+- **`avg_bwd (ms)`**: Time spent in the backward pass, including gradient computation and allreduce.
+- **`avg_fwd (ms)`**: Time spent in the forward pass.
+- **`avg_selective_optimizer_step (ms)`**: Time spent in the selective optimizer step — indicates overhead introduced by ZenFlow’s offloading logic.
+
+**Tips for Analysis:**
+
+- Lower **`avg_step`** means faster training.
+- Comparing configurations helps identify performance trade-offs (e.g., `pin_memory=True` often reduces transfer latency).
+- A higher **`update_interval`** typically reduces offloading frequency and overhead.
+- Enabling **`overlap_step=True`** can further hide offloading latency behind computation when the model update phase is longer.
@@ -0,0 +1,79 @@
+import re
+from collections import defaultdict
+import pandas as pd
+from tabulate import tabulate
+
+def parse_log_file(log_file_path):
+    with open(log_file_path, 'r') as f:
+        lines = f.readlines()
+
+    # Regex patterns
+    trial_header_re = re.compile(
+        r"\[Trial (\d+)] pin_memory=(\d), topk=([\d.]+), update=(\d+), overlap_step=(\d+) \(MASTER_PORT=\d+\)"
+    )
+    time_metrics_re = re.compile(r"\|\s*([^:|]+):\s*([\d.]+)")
+
+    trials = []
+    current_config = None
+    current_step_metrics = []
+
+    def finalize_trial():
+        if current_config and current_step_metrics:
+            # Get all unique keys
+            all_keys = set()
+            for step in current_step_metrics:
+                all_keys.update(step.keys())
+            # Aggregate and average
+            agg = {k: 0.0 for k in all_keys}
+            for step in current_step_metrics:
+                for k in all_keys:
+                    agg[k] += step.get(k, 0.0)
+            avg = {f"avg_{k}": agg[k] / len(current_step_metrics) for k in all_keys}
+            trials.append({**current_config, **avg, "num_steps": len(current_step_metrics)})
+
+    for line in lines:
+        header_match = trial_header_re.search(line)
+        if header_match:
+            finalize_trial()
+            trial_id, pin_memory, topk, update, overlap = header_match.groups()
+            current_config = {
+                "trial": int(trial_id),
+                "pin_memory": bool(int(pin_memory)),
+                "topk_ratio": float(topk),
+                "update_interval": int(update),
+                "overlap_step": bool(int(overlap))
+            }
+            current_step_metrics = []
+            continue
+
+        if "[Rank 0]" in line and "time (ms)" in line:
+            metrics = {k.strip(): float(v) for k, v in time_metrics_re.findall(line)}
+            current_step_metrics.append(metrics)
+
+    finalize_trial()
+    return pd.DataFrame(trials)
+
+if __name__ == "__main__":
+
+    log_file = "zf_benchmark.log"
+    df = parse_log_file(log_file)
+    df = df.sort_values(by=["topk_ratio", "overlap_step", "update_interval",  "pin_memory"])
+    cols_to_display = [
+        "trial", "topk_ratio", "update_interval", "overlap_step", "pin_memory",
+        "avg_step", "avg_bwd", "avg_fwd", "avg_selective_optimizer_step"
+    ]
+
+    headers_with_units = {
+        "trial": "trial",
+        "pin_memory": "pin_memory",
+        "update_interval": "update_interval",
+        "overlap_step": "overlap_step",
+        "topk_ratio": "topk_ratio",
+        "avg_step": "avg_step (ms)",
+        "avg_bwd": "avg_bwd (ms)",
+        "avg_fwd": "avg_fwd (ms)",
+        "avg_selective_optimizer_step": "avg_selective_optimizer_step (ms)"
+
+    }
+    headers = [headers_with_units[col] for col in cols_to_display]
+    print(tabulate(df[cols_to_display], headers=headers, tablefmt="psql", showindex=False))
@@ -0,0 +1,7 @@
+torch>=2.5.1
+deepspeed>=0.16.0
+datasets>=2.14.1
+transformers>=4.37.2
+numpy>=1.21.0
+tabulate
+pandas
@@ -0,0 +1,36 @@
+#!/bin/bash
+
+NGPUS=2
+HIDDEN_SIZE=4096
+NUM_LAYERS=4
+TRIALS=1
+
+PIN_MEMORY_OPTS=(0 1)
+TOPK_RATIOS=(0.1 0.2)
+UPDATE_INTERVALS=(2 4)
+OVERLAP_STEPS=(1 0)
+
+for pin_memory in "${PIN_MEMORY_OPTS[@]}"; do
+  for topk in "${TOPK_RATIOS[@]}"; do
+    for update in "${UPDATE_INTERVALS[@]}"; do
+      for overlap in "${OVERLAP_STEPS[@]}"; do
+        for ((trial=0; trial<$TRIALS; trial++)); do
+          # Generate a random port between 20000 and 65000
+          MASTER_PORT=$((20000 + RANDOM % 45000))
+          echo "[Trial $((trial+1))] pin_memory=$pin_memory, topk=$topk, update=$update, overlap_step=$overlap (MASTER_PORT=$MASTER_PORT)" | tee -a zf_benchmark.log
+          deepspeed --master_port $MASTER_PORT \
+            --num_gpus=$NGPUS \
+            zf_benchmark.py \
+            --hidden_dim $HIDDEN_SIZE \
+            --nlayers $NUM_LAYERS \
+            --iteration 5 \
+            --pin_memory_opts $pin_memory \
+            --topk_ratios $topk \
+            --update_intervals $update \
+            --overlap_steps $overlap | tee -a zf_benchmark.log
+        done
+      done
+    done
+  done
+done
+python output_table.py
@@ -0,0 +1,150 @@
+# Copyright (c) Microsoft Corporation.
+# SPDX-License-Identifier: Apache-2.0
+
+# DeepSpeed Team
+
+import argparse
+import torch
+import deepspeed.comm as dist
+import time
+
+import deepspeed
+
+class SimpleModel(torch.nn.Module):
+
+    def __init__(self, hidden_dim, empty_grad=False, nlayers=1):
+        super(SimpleModel, self).__init__()
+        self.linears = torch.nn.ModuleList([torch.nn.Linear(hidden_dim, hidden_dim) for _ in range(nlayers)])
+        if empty_grad:
+            self.linear2 = torch.nn.Linear(hidden_dim, hidden_dim)
+        self.cross_entropy_loss = torch.nn.CrossEntropyLoss()
+
+    def forward(self, x, y):
+        for l in self.linears:
+            x = l(x)
+        return self.cross_entropy_loss(x, y)
+
+
+def random_dataset(total_samples, hidden_dim, device, dtype):
+    train_data = torch.randn(total_samples, hidden_dim, device=device, dtype=dtype)
+    train_label = torch.empty(total_samples, dtype=torch.long, device=device).random_(hidden_dim)
+    train_dataset = torch.utils.data.TensorDataset(train_data, train_label)
+    return train_dataset
+
+
+def random_dataloader(model, total_samples, hidden_dim, device, dtype):
+    batch_size = model.train_micro_batch_size_per_gpu()
+    train_dataset = random_dataset(total_samples, hidden_dim, device, dtype=dtype)
+    train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=batch_size)
+    return train_loader
+
+
+def run_model(model, config_dict, hidden_dim, dtype, pin_memory, topk_ratio, update_interval, overlap_step, iteration):
+
+    model, _, _, _ = deepspeed.initialize(model=model, model_parameters=model.parameters(), config=config_dict)
+
+
+    data_loader = random_dataloader(model=model,
+                                    total_samples=iteration,
+                                    hidden_dim=hidden_dim,
+                                    device=model.device,
+                                    dtype=dtype)
+
+    time_step_list = []
+    accumulation_step_time_list = []
+    update_step_time_list = []
+
+    dist.barrier()
+    for i, batch in enumerate(data_loader):
+        step_start_time = time.time()
+        loss = model(batch[0], batch[1])
+        model.backward(loss)
+        model.step()
+        step_end_time = time.time()
+        step_time = step_end_time - step_start_time
+        if dist.get_rank() == 0:
+            print(f"Step {i} time: {step_time*1000:.2f}ms")
+        if i >= update_interval:
+            time_step_list.append(step_time)
+            if (i + 1) % update_interval == 0:
+                update_step_time_list.append(step_time)
+            else:
+                accumulation_step_time_list.append(step_time)
+
+    if dist.get_rank() == 0:
+        with open("zenflow_report.log", "a") as f:
+            msg = f"{1 if pin_memory else 0}," \
+                f"{topk_ratio}," \
+                f"{update_interval}," \
+                f"{overlap_step}," \
+                f"{sum(accumulation_step_time_list) / len(accumulation_step_time_list):.2f}," \
+                f"{sum(update_step_time_list) / len(update_step_time_list):.2f}"
+            f.write(f"{msg}\n")
+        print(f"[Summary] pin_memory={pin_memory} topk_ratio={topk_ratio} update_interval={update_interval} overlap_step={overlap_step} avg_accumulation_step={sum(accumulation_step_time_list) * 1000 / len(accumulation_step_time_list):.2f}ms avg_update_step={sum(update_step_time_list) * 1000 / len(update_step_time_list):.2f}ms")
+
+    model.destroy()
+
+def main():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--nlayers", type=int, default=1)
+    parser.add_argument("--hidden_dim", type=int, default=1024)
+    parser.add_argument("--dtype", choices=['torch.bfloat16', 'torch.float16', 'torch.float32'], default='torch.bfloat16')
+    parser.add_argument("--iteration", type=int, default=5)
+    parser.add_argument("--local_rank", type=int, default=-1)
+
+    parser.add_argument("--pin_memory_opts", type=int, required=True)
+    parser.add_argument("--topk_ratios", type=float, required=True)
+    parser.add_argument("--update_intervals", type=int, required=True)
+    parser.add_argument("--overlap_steps", type=int, required=True)
+
+    # Optional: explicitly receive master_port (though deepspeed handles it via env)
+    parser.add_argument("--master_port", type=int, default=None)
+
+    args = parser.parse_args()
+    dtype = eval(args.dtype)
+
+
+    pin_memory = bool(args.pin_memory_opts)
+    topk_ratio = args.topk_ratios
+    update_interval = args.update_intervals
+    overlap_step = bool(args.overlap_steps)
+    total_iteration = args.iteration * update_interval
+
+    config_dict = {
+        "train_micro_batch_size_per_gpu": 1,
+        "optimizer": {
+            "type": "Adam",
+            "params": {
+                "lr": 1e-6
+            }
+        },
+        "zero_optimization": {
+            "stage": 2,
+            "offload_optimizer": {
+                "device": "cpu",
+                "pin_memory": pin_memory
+            },
+            "zenflow": {
+                "topk_ratio": topk_ratio,
+                "update_interval": update_interval,
+                "full_warm_up_rounds": 0,
+                "overlap_step": overlap_step
+            },
+        },
+        "wall_clock_breakdown": True,
+        "zero_allow_untested_optimizer": True
+    }
+
+    if dtype == torch.float16:
+        config_dict["fp16"] = {"enabled": True, "initial_scale_power": 8}
+    elif dtype == torch.bfloat16:
+        config_dict["bf16"] = {"enabled": True}
+
+    model = SimpleModel(args.hidden_dim, nlayers=args.nlayers)
+    run_model(model, config_dict, args.hidden_dim, dtype,
+                pin_memory, topk_ratio, update_interval, overlap_step,
+                total_iteration)
+
+
+if __name__ == "__main__":
+    main()