Per Layer Streaming Quantization

A tradeoff users have often complained, most recently @aredden about is either they
1. quantize on CPU and then push the model to GPU -> Slow quantization but VRAM efficient
2. Push to model to GPU and then quantize on GPU -> Fast quantization but needs lots of VRAM

Instead we could have a utility that sends one layer at a time to the gpu, quantizes it and then sends in a new layer synchronously. Granted this workflow seems to interact in a clunky way with `torch.compile` where we don't compile things layer wise and generally expect the model to be on the device where its compiled

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Per Layer Streaming Quantization #655

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Per Layer Streaming Quantization #655

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions