-
Notifications
You must be signed in to change notification settings - Fork 345
Open
Description
In torchtune, cant resume from checkpoint when using torchao:
File "/data/users/felipemello/torchtune/torchtune/training/checkpointing/_utils.py", line 249, in safe_torch_load
state_dict = torch.load(
^^^^^^^^^^^
File "/home/felipemello/.conda/envs/torchtune/lib/python3.11/site-packages/torch/serialization.py", line 1486, in load
raise pickle.UnpicklingError(_get_wo_message(str(e))) from None
_pickle.UnpicklingError: Weights only load failed. This file can still be loaded, to do so you have two options, do those steps only if you trust the source of the checkpoint.
(1) In PyTorch 2.6, we changed the default value of the `weights_only` argument in `torch.load` from `False` to `True`. Re-running `torch.load` with `weights_only` set to `False` will likely succeed, but it can result in arbitrary code execution. Do it only if you got the file from a trusted source.
(2) Alternatively, to load with `weights_only=True` please check the recommended steps in the following error message.
WeightsUnpickler error: Unsupported global: GLOBAL torchao.prototype.low_bit_optim.subclass_8bit.OptimState8bit was not an allowed global by default. Please use `torch.serialization.add_safe_globals([OptimState8bit])` or the `torch.serialization.safe_globals([OptimState8bit])` context manager to allowlist this global if you trust this class/function.
to reproduce:
tune download meta-llama/Llama-3.2-1B-Instruct --output-dir /tmp/Llama-3.2-1B-Instruct --ignore-patterns "original/consolidated.00.pth"
tune run full_finetune_single_device --config llama3_2/1B_full_single_device epochs=2 max_steps_per_epoch=20 optimizer=torchao.prototype.low_bit_optim.AdamW8bit
tune run full_finetune_single_device --config llama3_2/1B_full_single_device epochs=2 max_steps_per_epoch=20 optimizer=torchao.prototype.low_bit_optim.AdamW8bit resume_from_checkpoint=True checkpointer.checkpoint_files=["epoch_0/model-00001-of-00001.safetensors"]