Closed
Description
Add Q2_0
and Q2_1
quantization support to ggml
:
- Follow the existing
Q4_0
andQ4_1
implementations - Implement reference scalar quantization and dequantization routines
- I suspect we might have to use
QK == 16
in this case to compensate for further accuracy losses - Add SIMD support for a specific architecture - investigate best strategy to perform the
ggml_vec_dot_q2()
computation - No need to implement
ggml_vec_mad_q2()
- these will be deprecated soon - Compute perplexity scores
The expected model sizes for 7B and QK == 16
are:
Q2_0
- 3.2 GB
For QK == 32
we have:
Q2_0
- 2.4 GBQ2_1
- 3.2 GB
Before you send me papers that show 2-bit quantization does not work - no need. I want to have this supported anyway. I have something in mind. The efforts needed to add this support are so small that there is no reason not to do it.
Activity
dakennedyd commentedon Mar 24, 2023
No 3-bit support?
ggerganov commentedon Mar 24, 2023
I don't think I can implement it efficiently, but if anyone wants to give it a try - sure
Green-Sky commentedon Mar 24, 2023
65B using 32gig ram anyone? 😆
prusnak commentedon Mar 24, 2023
I came up with a script that's able to compute RMS for various quantization methods - maybe it will come handy for experimenting: https://gist.github.com/prusnak/f54f8f33503458ca1aa9883f71897072
sw commentedon Mar 25, 2023
Go home Q2, you're drunk ;-)
This is cherry-picked, often it goes to babbling numbers right away.
Q3 seems decent:
Both are very slow because I haven't found a good way to use AVX2 yet. Perplexity would probably take days if not weeks.
I used float for the scale in Q2 and FP16 in Q3, so the model files actually are the same size:
For Q2 I deviated slightly from the standard calculation of the factors. If you want to have a zero value and symmetry in positive and negative range, that would have left only 3 values (-1 0 +1). Instead, I calculate the signed maximum (= value of largest magnitude, without applying
fabsf
), then I assign the value -2 to that maximum. The sign of the shared scaling factor is adjusted to give the right sign of the result. Without this modification, I couldn't get Q2 to output any semblance of english.Code here: https://github.com/sw/llama.cpp/tree/q2q3
sw commentedon Mar 27, 2023
Updated my branch with AVX optimizations, probably far from perfect.
Still quite slow...
Q2:
Q3:
CamiloMM commentedon Mar 31, 2023
Not nearly enough, we need support for 1-bit signed floats.
Interpause commentedon Apr 2, 2023
Swap that out for 1 qubit and now we're talking.
prusnak commentedon Apr 2, 2023
I think the best model size and performance will be achieved when 0-bit quantization is used.
Lolagatorade commentedon Apr 12, 2023
Mhmm possibly -1...
3 remaining items
ggerganov commentedon Jun 24, 2023
Thanks to K-quants this is now available
MrMage commentedon Jun 26, 2023
Have there been any new insights into the quality of 2-bit quantization? I.e. does that approach produce reasonable results now?
Green-Sky commentedon Jun 26, 2023
@MrMage pure q2 will never be good, but the k-quants use a mixture with some q2 to achieve reasonable results. checkout how
LLAMA_FTYPE_MOSTLY_Q2_K
is composed here #1684neelr commentedon Nov 17, 2023
https://arxiv.org/abs/2307.13304
Merge pull request ggml-org#456 from AgentJ-WR/patch-1