-
Notifications
You must be signed in to change notification settings - Fork 312
Closed
Labels
good first issueGood for newcomersGood for newcomers
Description
A tradeoff users have often complained, most recently @aredden about is either they
- quantize on CPU and then push the model to GPU -> Slow quantization but VRAM efficient
- Push to model to GPU and then quantize on GPU -> Fast quantization but needs lots of VRAM
Instead we could have a utility that sends one layer at a time to the gpu, quantizes it and then sends in a new layer synchronously. Granted this workflow seems to interact in a clunky way with torch.compile
where we don't compile things layer wise and generally expect the model to be on the device where its compiled
Metadata
Metadata
Assignees
Labels
good first issueGood for newcomersGood for newcomers