Skip to content

Per Layer Streaming Quantization #655

@msaroufim

Description

@msaroufim

A tradeoff users have often complained, most recently @aredden about is either they

  1. quantize on CPU and then push the model to GPU -> Slow quantization but VRAM efficient
  2. Push to model to GPU and then quantize on GPU -> Fast quantization but needs lots of VRAM

Instead we could have a utility that sends one layer at a time to the gpu, quantizes it and then sends in a new layer synchronously. Granted this workflow seems to interact in a clunky way with torch.compile where we don't compile things layer wise and generally expect the model to be on the device where its compiled

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions