Open
Description
Hi! I try to reproduce the benchmark results using torchao/_models/llama/generate.py. However, I can not benchmark the quantized model successfully. Specifically, when using a torch version < 2.5.0, I got the following error:
File "/mnt/workspace/Lumina-mGPT/torchao_benchmark.py", line 310, in main
unwrap_tensor_subclass(model)
File "/mnt/workspace/anaconda3/envs/lumina_mgpt/lib/python3.10/site-packages/torchao/utils.py", line 287, in unwrap_tensor_subclass
unwrap_tensor_subclass(child)
File "/mnt/workspace/anaconda3/envs/lumina_mgpt/lib/python3.10/site-packages/torchao/utils.py", line 287, in unwrap_tensor_subclass
unwrap_tensor_subclass(child)
File "/mnt/workspace/anaconda3/envs/lumina_mgpt/lib/python3.10/site-packages/torchao/utils.py", line 287, in unwrap_tensor_subclass
unwrap_tensor_subclass(child)
File "/mnt/workspace/anaconda3/envs/lumina_mgpt/lib/python3.10/site-packages/torchao/utils.py", line 286, in unwrap_tensor_subclass
parametrize.register_parametrization(child, "weight", UnwrapTensorSubclass())
File "/mnt/workspace/anaconda3/envs/lumina_mgpt/lib/python3.10/site-packages/torch/nn/utils/parametrize.py", line 562, in register_parametrization
parametrizations = ParametrizationList([parametrization], original, unsafe=unsafe)
File "/mnt/workspace/anaconda3/envs/lumina_mgpt/lib/python3.10/site-packages/torch/nn/utils/parametrize.py", line 173, in __init__
originali = Parameter(originali)
File "/mnt/workspace/anaconda3/envs/lumina_mgpt/lib/python3.10/site-packages/torch/nn/parameter.py", line 40, in __new__
return torch.Tensor._make_subclass(cls, data, requires_grad)
RuntimeError: Only Tensors of floating point and complex dtype can require gradients
When upgrading the torch version to 2.5.0, the process got stucked and not responding for a very long time:
Using device=cuda
Loading model ...
Time to load model: 54.85 seconds
Compiling Model
^C^C^C^C^C^C
I do not see any CPU usage with top command, and I have to kill the process by its id.
Also, it there any way to accelerate a huggingface model by quantizing it with torchao, without converting the model format?