Skip to content

2-bit integer quantization  #456

Closed
Closed
@ggerganov

Description

@ggerganov

Add Q2_0 and Q2_1 quantization support to ggml:

  • Follow the existing Q4_0 and Q4_1 implementations
  • Implement reference scalar quantization and dequantization routines
  • I suspect we might have to use QK == 16 in this case to compensate for further accuracy losses
  • Add SIMD support for a specific architecture - investigate best strategy to perform the ggml_vec_dot_q2() computation
  • No need to implement ggml_vec_mad_q2() - these will be deprecated soon
  • Compute perplexity scores

The expected model sizes for 7B and QK == 16 are:

  • Q2_0 - 3.2 GB

For QK == 32 we have:

  • Q2_0 - 2.4 GB
  • Q2_1 - 3.2 GB

Before you send me papers that show 2-bit quantization does not work - no need. I want to have this supported anyway. I have something in mind. The efforts needed to add this support are so small that there is no reason not to do it.

Activity

dakennedyd

dakennedyd commented on Mar 24, 2023

@dakennedyd
Contributor

No 3-bit support?

ggerganov

ggerganov commented on Mar 24, 2023

@ggerganov
MemberAuthor

No 3-bit support?

I don't think I can implement it efficiently, but if anyone wants to give it a try - sure

Green-Sky

Green-Sky commented on Mar 24, 2023

@Green-Sky
Collaborator

65B using 32gig ram anyone? 😆

prusnak

prusnak commented on Mar 24, 2023

@prusnak
Collaborator

I came up with a script that's able to compute RMS for various quantization methods - maybe it will come handy for experimenting: https://gist.github.com/prusnak/f54f8f33503458ca1aa9883f71897072

sw

sw commented on Mar 25, 2023

@sw
Contributor

Go home Q2, you're drunk ;-)

$ ./main -m ./models/7B/ggml-model-q2_0.bin -p "The efforts needed to add this support are so small that there is no reason not to do it." -n 64 -s 1679735763

The efforts needed to add this support are so small that there is no reason not to do it.
The efforts that we need the work to make sure that we can be sure that everything falls together with no additional and very little is reserved for a little or 1, or even less or 13 is 13, that in additionally or 1 month faster is 18 and or even faster

This is cherry-picked, often it goes to babbling numbers right away.

Q3 seems decent:

$ ./main -m ./models/7B/ggml-model-q3_0.bin -p "Building a website can be done in 10 simple steps:" -n 128 -s 1679739910

Building a website can be done in 10 simple steps:
Decide which web authoring software you're going to use.
Read up on what you need for the site you're building. Note that I am only referring to reading material on the web here; reading will build your knowledge without spending money on a book (or e-book). I would suggest looking into JavaScript, HTML5 and CSS3 before you launch into development of any kind. You can always test the waters of what you're working with against an online validator before you launch into production mode -- or you could just skip that part altogether until you get frustrated with having to use a browser

Both are very slow because I haven't found a good way to use AVX2 yet. Perplexity would probably take days if not weeks.

I used float for the scale in Q2 and FP16 in Q3, so the model files actually are the same size:

$ ls -gho models/7B/*q*
-rw-rw-r-- 1 3.2G Mär 25 10:43 models/7B/ggml-model-q2_0.bin
-rw-rw-r-- 1 3.2G Mär 25 10:45 models/7B/ggml-model-q3_0.bin
-rw-rw-r-- 1 4.0G Mär 24 11:52 models/7B/ggml-model-q4_0.bin
-rw-rw-r-- 1 4.8G Mär 22 13:08 models/7B/ggml-model-q4_1.bin

For Q2 I deviated slightly from the standard calculation of the factors. If you want to have a zero value and symmetry in positive and negative range, that would have left only 3 values (-1 0 +1). Instead, I calculate the signed maximum (= value of largest magnitude, without applying fabsf), then I assign the value -2 to that maximum. The sign of the shared scaling factor is adjusted to give the right sign of the result. Without this modification, I couldn't get Q2 to output any semblance of english.

Code here: https://github.com/sw/llama.cpp/tree/q2q3

sw

sw commented on Mar 27, 2023

@sw
Contributor

Updated my branch with AVX optimizations, probably far from perfect.

Still quite slow...
Q2:

98.37 seconds per pass - ETA 17.90 hours
[1]147.6625,[2]136.8862,[3]132.6015,[4]127.8629,[5]120.4091,[6]111.7640,[7]114.2548,[8]112.8951,

Q3:

203.61 seconds per pass - ETA 37.05 hours
[1]7.0481,[2]8.0335,[3]8.8317,[4]10.0700,[5]10.1138,[6]9.9850,[7]10.2314,[8]10.2057,
CamiloMM

CamiloMM commented on Mar 31, 2023

@CamiloMM

Not nearly enough, we need support for 1-bit signed floats.

Interpause

Interpause commented on Apr 2, 2023

@Interpause

Not nearly enough, we need support for 1-bit signed floats.

Swap that out for 1 qubit and now we're talking.

prusnak

prusnak commented on Apr 2, 2023

@prusnak
Collaborator

Not nearly enough, we need support for 1-bit signed floats.

I think the best model size and performance will be achieved when 0-bit quantization is used.

Lolagatorade

Lolagatorade commented on Apr 12, 2023

@Lolagatorade

Not nearly enough, we need support for 1-bit signed floats.

I think the best model size and performance will be achieved when 0-bit quantization is used.

Mhmm possibly -1...

3 remaining items

linked a pull request that will close this issueQ2 and Q3 quantization #1004on Apr 16, 2023
ggerganov

ggerganov commented on Jun 24, 2023

@ggerganov
MemberAuthor

Thanks to K-quants this is now available

MrMage

MrMage commented on Jun 26, 2023

@MrMage

Have there been any new insights into the quality of 2-bit quantization? I.e. does that approach produce reasonable results now?

Green-Sky

Green-Sky commented on Jun 26, 2023

@Green-Sky
Collaborator

@MrMage pure q2 will never be good, but the k-quants use a mixture with some q2 to achieve reasonable results. checkout how LLAMA_FTYPE_MOSTLY_Q2_K is composed here #1684

added a commit that references this issue on Dec 19, 2023

Merge pull request ggml-org#456 from AgentJ-WR/patch-1

236c4cf
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

    Development

    Participants

    @sw@prusnak@MrMage@pubby@ggerganov

    Issue actions

      2-bit integer quantization · Issue #456 · ggml-org/llama.cpp