Support QuaRot quantization scheme

A new, interesting quantization scheme was published, which not only reduces memory consumption (like current quantization schemes), but als reduces computations.

> **[QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs](https://arxiv.org/abs/2404.00456)**
> We introduce QuaRot, a new Quantization scheme based on Rotations, which is able to quantize LLMs end-to-end, including all weights, activations, and KV cache in 4 bits. QuaRot rotates LLMs in a way that removes outliers from the hidden state without changing the output, making quantization easier. This computational invariance is applied to the hidden state (residual) of the LLM, as well as to the activations of the feed-forward components, aspects of the attention mechanism and to the KV cache. The result is a quantized model where all matrix multiplications are performed in 4-bits, without any channels identified for retention in higher precision. Our quantized LLaMa2-70B model has losses of at most 0.29 WikiText-2 perplexity and retains 99% of the zero-shot performance. Code is available at: [this https URL](https://github.com/spcl/QuaRot).

I think it would be interesting to see if this technique, or parts of it, could be adopted in llama.cpp, to speed up inference of quantized models.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support QuaRot quantization scheme #6444

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Support QuaRot quantization scheme #6444

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions