-
Notifications
You must be signed in to change notification settings - Fork 310
Description
From our README.md
torchao is a library to create and integrate high-performance custom data types layouts into your PyTorch workflows
And so far we've done a good job building out the primitive data types along with their corresponding transformed Linear Layers so for example given a new ExoticDtype()
we have a playbook to create ExoticDtypeLinear()
and indeed for weight only transformations this is a perfectly fine workflow and how the majority of quantization libraries operate.
For example
m = DownloadModelFromHuggingFace()
quantize_(m, int4_weight_only()) # This will swap out all torch.nn.Linear with a 4 bit Linear
We can make the above shine with more accessible blogs and performance benchmarks and integrations with more partners
However, this is doing somewhat of a disservice at explaining the ao value proposition. For example, we're a dtype library and not a dtype Linear library so given a dtype it should be easy for us to do a lot more. So some examples I'd like to see next are
- Quantized Optimizers with the most obvious additions being 8 bit and 4 bit ADAM
- Quantized KV cache
- Quantization Aware training with an exotic dtype
None of the above is "research", this is very much the way engineering is moving for inference https://blog.character.ai/optimizing-ai-inference-at-character-ai/
Also given an exotic quantization schema I'd like to be more proactive in helping people benchmark their models so this should include
- Flop utilization
- Memory bandwidth
- Cache hit rate (for kv cache only)
- Roofline analysis