Skip to content

Conversation

tfaod
Copy link
Contributor

@tfaod tfaod commented Aug 23, 2025

New Submission

Submission Information

Please fill out the following information about your submission within the quotation marks.

submission_name: "ademamix"
submission_folder: "submissions/self_tuning/ademamix"  
submission authors:
  * authors: "Alice Yang"  # List authors separated by commas
  * affiliations: "Meta Superintelligence Labs, FAIR Team "
algorithm authors:
  * authors: "Matteo Pagliardini, Pierre Ablin, David Grangier"
  * affiliations: "EPFL, Apple"
version: "1.0"  # Optional version number of your submission
ruleset: "self-tuning"
framework: "PyTorch"
description: "ademamix optimizer with optimal hparams"

Evidence for the Submission's Performance

  • See results from two AdEMAMix sweeps, compared to baseline sfadamw_v2 and nadamw submissions

Sweep Details

  • AdEMAMix Sweep 1 - compare with/wo warmup for beta3
    • runs wo warmup for beta3 struggled to hit more than half of the targets, while runs with warmup for beta3 consistently hit most/all targets
  • AdEMAMix Sweep 2 - sweep over lr, wd, alpha
    • sweep range:
      • wd: [0, 0.1]
      • lr: [1e-4 to 5e-3]
      • alpha: [8, 10]
    • fixed values from lion paper
      • beta1: 0.9
      • beta2: 0.999
      • beta3: 0.9999
      • alpha_warmup: 500000,
      • beta_warmup: 500000
  • AdEMAMix Sweep 3 across betas for top wd, lr, alpha values
    • sweep range:
      • beta1: [0.8, 0.99]
      • beta2: [0.95, 0.999]
      • beta3: [0.99, 0.9999]
    • fixed values:
      • wd: 0.1
      • lr: {2e-3, 5e-3}
      • alpha: 8
      • alpha_warmup: 500000,
      • beta_warmup: 500000
  • Final Top Values:
    • (incl criteo run on 32gb): all_ademamix_w_sched_alpha_sweep_over_betas_with_criteo_on_32gb_betas1-0.8_2-0.995_30.9995_lr0.002_wd0.1_alpha8
    • (excl criteo): all_ademamix_w_sched_alpha_sweep_over_betas_betas1-0.95_2-0.99_30.9999_lr0.002_wd0.1_alpha8
image ... image ... image

Comments

  • AdEMAMix requires more memory, due to its addition of a third momentum sequence.
  • The optimizer runs out of memory on the criteo1tb workload with the preset batch size (~262k).
  • We tried the following three remediations
    • Strategy 1) Sweeping over the batch size revealed that the criteo workload began to hit the target when batch size was decreased to ~60k.
      • Despite reaching the target on all workloads, the runtime was increased significantly to where the algorithm was no longer competitive.
    • Strategy 2) We implemented and swept over across four "memory-safe" versions of the AdEMAmix algorithm.
      • While the optimizer was able to hit the workload on criteo1tb, it was significantly slower on the remaining workloads.
    • Strategy 3) We doubled the the available memory by running criteo1tb workload on 8x 32gb A100 GPUs.
      • We combined these results with the remaining workloads run on the tradition 16GB GPUs.
      • This modified algorithm significantly outperformed the competitive self-tuning nadamw baseline.
  • We find a significant tradeoff between memory-consumption and speed in the AdEMAMix algorithm. We will look into future modifications of the AdEMAMix algorithm to preserve the competitive speed while reducing memory usage.

@tfaod tfaod requested a review from a team as a code owner August 23, 2025 20:00
Copy link

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

@tfaod
Copy link
Contributor Author

tfaod commented Aug 23, 2025

@priyakasimbeg @fsschneider The AdEMAMix submission, as requested. The optimal hparams include ogbg.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant