Skip to content

CPUoffloadOptimizer issues #1209

Open
Open
@felipemello1

Description

@felipemello1

hi all, i was giving the CPUOffloadOptimizer a try and found two issues when using with QLoRA single device in torchtune:

  1. When using a LR scheduler i got. Maybe there is a way to inherit the optimizer class?
File "/data/users/felipemello/torchtune/torchtune/training/lr_schedulers.py", line 58, in get_cosine_schedule_with_warmup
    return LambdaLR(optimizer, lr_lambda, last_epoch)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/felipemello/.conda/envs/torchtune/lib/python3.11/site-packages/torch/optim/lr_scheduler.py", line 336, in __init__
    super().__init__(optimizer, last_epoch, verbose)
  File "/home/felipemello/.conda/envs/torchtune/lib/python3.11/site-packages/torch/optim/lr_scheduler.py", line 99, in __init__
    raise TypeError(f"{type(optimizer).__name__} is not an Optimizer")
TypeError: CPUOffloadOptimizer is not an Optimizer
  1. When passing model.params() i got the error below. I imagine that a simple fix is to keep only params that require grad, like adamw implementation oes
  File "/home/felipemello/.conda/envs/torchtune/lib/python3.11/site-packages/torchao/prototype/low_bit_optim/cpu_offload.py", line 76, in __init__
    p_cuda.register_post_accumulate_grad_hook(backward_hook)
  File "/home/felipemello/.conda/envs/torchtune/lib/python3.11/site-packages/torch/_tensor.py", line 678, in register_post_accumulate_grad_hook
    raise RuntimeError(
RuntimeError: cannot register a hook on a tensor that doesn't require gradient

cc: @gau-nernst

Activity

gau-nernst

gau-nernst commented on Nov 1, 2024

@gau-nernst
Collaborator

1 is a known issue. You can see my view here #959 (comment). I will look into torch.optim.Optimizer base class to see what could go wrong if I make CPUOffloadOptimizer inherit it. For example, on the top of my head, CPUOffloadOptimizer will not have self.state.

In the meantime, CPUOffloadOptimizer requires setting LR manually #584 (comment)

For 2, it's an oversight from my part. We can simply add a requires grad check here. Will push a fix

for p_cuda in params:
# pre-allocate CPU params and grads
p_cpu = torch.empty_like(p_cuda, device="cpu", pin_memory=True)
p_cpu.grad = torch.empty_like(p_cpu, pin_memory=True)
p_cpu.copy_(p_cuda.detach(), non_blocking=True)
self.param_cuda2cpu_map[p_cuda] = p_cpu
p_cuda.register_post_accumulate_grad_hook(backward_hook)
self.optim_dict[p_cuda] = optimizer_class([{"params": p_cpu, **param_group}], **kwargs)

fzyzcjy

fzyzcjy commented on Nov 18, 2024

@fzyzcjy

Hi, is there any updates? Thanks! It would be great if it can be directly plugged into huggingface transformers, but now it has errors caused by scheduler issue above:

[10:19:58.912]:     self.trainer.inner.train()
[10:19:58.912]:   File "/opt/conda/lib/python3.11/site-packages/trl/trainer/sft_trainer.py", line 434, in train
[10:19:58.912]:     output = super().train(*args, **kwargs)
[10:19:58.912]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[10:19:58.912]:   File "/opt/conda/lib/python3.11/site-packages/transformers/trainer.py", line 2123, in train
[10:19:58.912]:     return inner_training_loop(
[10:19:58.912]:            ^^^^^^^^^^^^^^^^^^^^
[10:19:58.912]:   File "/opt/conda/lib/python3.11/site-packages/transformers/trainer.py", line 2224, in _inner_training_loop
[10:19:58.912]:     self.create_optimizer_and_scheduler(num_training_steps=max_steps)
[10:19:58.912]:   File "/opt/conda/lib/python3.11/site-packages/transformers/trainer.py", line 1130, in create_optimizer_and_scheduler
[10:19:58.912]:     self.create_scheduler(num_training_steps=num_training_steps, optimizer=optimizer)
[10:19:58.912]:   File "/opt/conda/lib/python3.11/site-packages/transformers/trainer.py", line 1632, in create_scheduler
[10:19:58.912]:     self.lr_scheduler = get_scheduler(
[10:19:58.912]:                         ^^^^^^^^^^^^^^
[10:19:58.912]:   File "/opt/conda/lib/python3.11/site-packages/transformers/optimization.py", line 550, in get_scheduler
[10:19:58.913]:     return schedule_func(
[10:19:58.913]:            ^^^^^^^^^^^^^^
[10:19:58.913]:   File "/opt/conda/lib/python3.11/site-packages/transformers/optimization.py", line 132, in get_linear_schedule_with_warmup
[10:19:58.913]:     return LambdaLR(optimizer, lr_lambda, last_epoch)
[10:19:58.913]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[10:19:58.913]:   File "/opt/conda/lib/python3.11/site-packages/torch/optim/lr_scheduler.py", line 336, in __init__
[10:19:58.913]:     super().__init__(optimizer, last_epoch, verbose)
[10:19:58.913]:   File "/opt/conda/lib/python3.11/site-packages/torch/optim/lr_scheduler.py", line 99, in __init__
[10:19:58.913]:     raise TypeError(f"{type(optimizer).__name__} is not an Optimizer")
[10:19:58.913]: TypeError: CPUOffloadOptimizer is not an Optimizer
gau-nernst

gau-nernst commented on Nov 19, 2024

@gau-nernst
Collaborator

@fzyzcjy To unblock your case, you can try making CPUOffloadOptimizer a subclass of torch.optim.Optimizer i.e. change the following line

class CPUOffloadOptimizer:

to class CPUOffloadOptimizer(Optimizer):. Make sure to not call super().__init__(), as this is just a workaround to pass the class check by PyTorch LR scheduler. I will investigate if this will cause other issues before merging the fix.

IMO, since Python is duck-typing, PyTorch LR scheduler should not explicitly check for the optimizer class.

fzyzcjy

fzyzcjy commented on Nov 19, 2024

@fzyzcjy

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @fzyzcjy@felipemello1@gau-nernst@drisspg

        Issue actions

          CPUoffloadOptimizer issues · Issue #1209 · pytorch/ao