Add support for AC budget API #1731

tohskai · 2025-09-21T13:10:57Z

Inspired by the blogpost:
https://pytorch.org/blog/activation-checkpointing-techniques/

meta-cla · 2025-09-21T13:11:03Z

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

meta-cla · 2025-09-21T14:07:47Z

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

wwwjn

I read the blog and find the memory budget idea is cool. Have you try out the implementation on some model (eg, llama3) with torch.compile? I'm curious does it works end to end and if the performance are better

torchtitan/config/job_config.py

tohskai · 2025-09-22T19:35:55Z

I read the blog and find the memory budget idea is cool. Have you try out the implementation on some model (eg, llama3) with torch.compile? I'm curious does it works end to end and if the performance are better

I haven't done runs on llama3, but on our benchmarks on it showed significant improvements over regular SAC. This is why I wanted to upstream this :)

But our model is quite different, so it's totally reasonable to see less gains.

wwwjn · 2025-09-22T21:24:55Z

Thanks for sharing! We would love to see more verifications - eg, correctness and loss curves , and performance analysis on titan supported models (llama3, etc)

cc @soulitzer for reviewing

torchtitan/distributed/activation_checkpoint.py

torchtitan/config/job_config.py

tianyu-l

I haven't done runs on llama3, but on our benchmarks on it showed significant improvements over regular SAC. This is why I wanted to upstream this :)

I agree with @wwwjn that

We would love to see more verifications - eg, correctness and loss curves , and performance analysis on titan supported models (llama3, etc)

Please refer to https://github.com/pytorch/torchtitan/blob/main/CONTRIBUTING.md#proof-of-value

tohskai · 2025-10-01T16:29:48Z

@wwwjn @soulitzer @tianyu-l

Should this support selection of activation_memory_budget_solver and activation_memory_budget_runtime_estimator? What about visualize_memory_budget_pareto? I found them to be useful, and the discoverability is low, but given that this is unstable api and potentially feature overload, I would prefer to hear your opinion.

https://github.com/pytorch/pytorch/blob/main/torch/_functorch/config.py#L147-L169

tohskai · 2025-10-01T16:40:52Z

I ran models/llama3/train_configs/llama3_8b.toml on 8xH100:

Would that suffice for Proof of Value?

tohskai · 2025-10-06T19:12:14Z

I addressed your comments, rebased to avoid conflicts and added the other parts of the api, but I am totally okay with reverting it back, just waiting for your opinion.

tianyu-l

Thank you! Sounds good in general. Left some comments. Please see if they make sense.

tianyu-l · 2025-10-09T03:05:52Z

torchtitan/distributed/activation_checkpoint.py

-        )
-        model.layers.register_module(layer_id, transformer_block)
+    if ac_config.mode == "memory_budget":
+        assert (model_compile_enabled is True), "Memory budget mode requires model to be compiled"


Suggested change

assert (model_compile_enabled is True), "Memory budget mode requires model to be compiled"

assert model_compile_enabled, "Memory budget mode requires model to be compiled"

tianyu-l · 2025-10-09T03:09:11Z

torchtitan/config/job_config.py

+    activation_memory_budget_runtime_estimator: Literal["flops", "profile"] = "flops"
+    """
+    This controls how we estimate the runtime when deciding what the cheapest
+    operators to recompute are. The 3 options are
+    "flops": Bases it off of the flop count provided by torch.utils.flop_counter
+    "profile": Benchmarks each operator to come up with a runtime
+    "testing": Returns 1 for everything
+    """
+    activation_memory_budget_solver: Literal["dp", "greedy", "ilp"] = "dp"
+    """
+    This controls the solver used for the 0-1 knapsack. By default we use a
+    quantized DP solution ("dp"). The other approaches are a "greedy" and a "ilp"
+    (which has a scipy dependency).
+    """


It's hard to tell if users should change them. Maybe let's remove them for now and see if people have complaints.

tianyu-l · 2025-10-09T03:11:31Z

torchtitan/config/job_config.py

+    quantized DP solution ("dp"). The other approaches are a "greedy" and a "ilp"
+    (which has a scipy dependency).
+    """
+    visualize_memory_budget_pareto: bool = False


This one I'm not sure. It could be useful because otherwise people have no idea what value should they set the budge?
Btw, the picture you linked doesn't seem to be in "increments of 0.5". It seems 0.05.

tianyu-l · 2025-10-09T03:12:12Z

torchtitan/config/job_config.py

    Whether to stop recomputing early when all activations have already been
    rematerialized.
    """
+    activation_memory_budget: float = 1.0


Maybe set 0.5 by default, o/w if user turns on memory_budget without tuning this, nothing would happen.

tianyu-l · 2025-10-09T03:12:58Z

torchtitan/config/job_config.py

+    This dumps out a SVG visualization of the expected runtime vs. activation
+    memory tradeoffs for all memory budget values from 0 to 1 in increments of
+    0.5. See an example here:
+    https://github.com/pytorch/pytorch/pull/126320#discussion_r1625104015


Please describe what folder it'll dump into.

tianyu-l · 2025-10-09T03:13:42Z

torchtitan/config/job_config.py

    Whether to stop recomputing early when all activations have already been
    rematerialized.
    """
+    activation_memory_budget: float = 1.0


nit: please add space between configs, o/w hard to tell if a message is associated with config above / below.

tianyu-l · 2025-10-09T03:14:35Z

torchtitan/config/job_config.py

    Whether to stop recomputing early when all activations have already been
    rematerialized.
    """
+    activation_memory_budget: float = 1.0


activation_ is redundant, so we can just call it

Suggested change

activation_memory_budget: float = 1.0

memory_budget: float = 1.0

tianyu-l · 2025-10-09T03:16:19Z

torchtitan/config/job_config.py

+    0.0 corresponds to the activation memory from applying
+    activation checkpointing to the full compiled region, and 1.0 corresponds to
+    the activation memory from the default runtime-optimized strategy.
+    """


Please add a link to the post https://pytorch.org/blog/activation-checkpointing-techniques/

tohskai requested review from tianyu-l, fegin, wwwjn and wconstab as code owners September 21, 2025 13:10

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Sep 21, 2025

wwwjn reviewed Sep 22, 2025

View reviewed changes

torchtitan/config/job_config.py Outdated Show resolved Hide resolved

tianyu-l requested a review from soulitzer September 22, 2025 21:19

tianyu-l reviewed Sep 24, 2025

View reviewed changes

torchtitan/distributed/activation_checkpoint.py Outdated Show resolved Hide resolved

soulitzer reviewed Sep 24, 2025

View reviewed changes

torchtitan/config/job_config.py Outdated Show resolved Hide resolved

soulitzer reviewed Sep 24, 2025

View reviewed changes

torchtitan/config/job_config.py Outdated Show resolved Hide resolved

tianyu-l requested changes Sep 24, 2025

View reviewed changes

Add support for AC budget API

3646ad8

tohskai force-pushed the memory_budget branch from d5766cb to 3646ad8 Compare October 6, 2025 18:54

tohskai requested a review from tianyu-l October 7, 2025 09:12

tianyu-l reviewed Oct 9, 2025

View reviewed changes

	assert (model_compile_enabled is True), "Memory budget mode requires model to be compiled"
	assert model_compile_enabled, "Memory budget mode requires model to be compiled"

	activation_memory_budget: float = 1.0
	memory_budget: float = 1.0

Add support for AC budget API #1731

Are you sure you want to change the base?

Add support for AC budget API #1731

Conversation

tohskai commented Sep 21, 2025

Uh oh!

meta-cla bot commented Sep 21, 2025

Action Required

Process

Uh oh!

meta-cla bot commented Sep 21, 2025

Uh oh!

wwwjn left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tohskai commented Sep 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wwwjn commented Sep 22, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

tohskai commented Oct 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tohskai commented Oct 1, 2025

Uh oh!

tohskai commented Oct 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tohskai commented Sep 22, 2025 •

edited

Loading

tohskai commented Oct 1, 2025 •

edited

Loading

tohskai commented Oct 6, 2025 •

edited

Loading