Skip to content

Make k-quants work with tensor dimensions that are not multiple of 256 #1919

Closed
@ikawrakow

Description

@ikawrakow

As discussed in #1602, k-quants do not work for the Falcon-7B model. This is due to the fact that the number of columns in many tensors (4544) is not divisible by 256, which is the super-block size of the k-quants.

It would be useful if k-quants could be adapted to work in such cases.

Activity

SlyEcho

SlyEcho commented on Jun 18, 2023

@SlyEcho
Collaborator

Same for OpenLLaMA 3B.

debackerl

debackerl commented on Jun 19, 2023

@debackerl

Does it make sense to improve older quantization algorithms if they are not state of the art? SqueezeLLM has just been released, beating GPTQ on perplexity and speed, and they also released their CUDA kernels.

https://github.com/SqueezeAILab/SqueezeLLM

Just my 2 cents :-)

ikawrakow

ikawrakow commented on Jun 19, 2023

@ikawrakow
ContributorAuthor

Does it make sense to improve older quantization algorithms if they are not state of the art? SqueezeLLM has just been released, beating GPTQ on perplexity and speed, and they also released their CUDA kernels.

https://github.com/SqueezeAILab/SqueezeLLM

Just my 2 cents :-)

My concept is that k-quants are SOTA. The discussion in #1595 and the description of #1684 shed some light on why I think k-quants are SOTA. Perhaps you could elaborate on how you arrived at the conclusion that the SquezzeLLM approach is better?

KerfuffleV2

KerfuffleV2 commented on Jun 19, 2023

@KerfuffleV2
Collaborator

This might be a crazy idea, but what if the remainder not divisible by QK_K was just left as f32 or f16 and not quantized at all?

It might need some special logic to handle but it would only apply for the very last partial "block" and it should be possible to calculate if that's necessary outside of any loops.

ikawrakow

ikawrakow commented on Jun 19, 2023

@ikawrakow
ContributorAuthor

I'm on it. The current thinking is that I will add padding such that I have a multiple of 256. When quantizing, the values that are beyond the actual tensor size will be assumed to be zero. When de-quantizing, one needs to take care to not dequantize values beyond the actual tensor size. Same applies to dot products.

It is a pretty big change, so it will take some time.

The alternative that was proposed somewhere above is to use super-blocks of 64. This will work for Falcon-7b and for OpenLLaMa 3B. It is a potentially smaller change, but using super-blocks of 64 almost defeats the purpose of the super-blocks, which is to save bits by using quantized scales for the blocks inside a super-block. To give a specific example, with a super-block of 256 and Q4_K, we have 8 blocks of 32, each having a scale and a min of 6 bits, so that's 8 * 12 = 96 bits. We then have the fp16 scale and min of the super-block, which is another 32 bits for a total of 128 bits per super-block, or 0.5 bits of extra data per weight. For a super-block size of 64 we have 2 * 12 + 32 = 56 bits per super-block, or 0.875 bits per weight. That's almost the same as Q4_1 (1 extra bit per weight), so we might as well add to Q4_1 the rmse+cosine distance minimization that is used in Q4_K while quantizing and just use a modified version of Q4_1.

SlyEcho

SlyEcho commented on Jun 19, 2023

@SlyEcho
Collaborator

Padding per row should be possible, after all we store the row length and its size in bytes separately.

TheBloke

TheBloke commented on Jun 20, 2023

@TheBloke
Contributor

Just to add that, FYI, I just learned that of another type of model that's affected: certain Llama models based on OpenAssistant, which have a vocab size of 32016

Example model exhibiting this: https://huggingface.co/MetaIX/GPT4-X-Alpasta-30b

convert.py:
Writing vocab...
[  1/543] Writing tensor tok_embeddings.weight                  | size  32016 x   6656  | type UnquantizedDataType(name='F16')
[  2/543] Writing tensor norm.weight                            | size   6656           | type UnquantizedDataType(name='F32')
[  3/543] Writing tensor output.weight                          | size  32016 x   6656  | type UnquantizedDataType(name='F16')
[  4/543] Writing tensor layers.0.attention.wq.weight           | size   6656 x   6656  | type UnquantizedDataType(name='F16')
[  5/543] Writing tensor layers.0.attention.wk.weight           | size   6656 x   6656  | type UnquantizedDataType(name='F16')
...
quantize:
llama.cpp: loading model from /workspace/process/alpasta-30b/ggml/alpasta-30b.ggmlv3.fp16.bin
llama.cpp: saving model to /workspace/process/alpasta-30b/ggml/alpasta-30b.ggmlv3.q2_K.bin
========================= Tensor sizes 6656 x 32016 are not divisible by 256
This is required to be able to use k-quants for now!
========================================================================================

Out of interest, did something change with regards to this in the last week or two? Because 11 days ago I quantised OpenAssistant-SFT-7 which also uses 32016 x 6656 and it quantised fine: https://huggingface.co/TheBloke/OpenAssistant-SFT-7-Llama-30B-GGML/tree/main

image

KerfuffleV2

KerfuffleV2 commented on Jun 20, 2023

@KerfuffleV2
Collaborator

Out of interest, did something change with regards to this in the last week or two?

A check to make sure the sizes were compatible with k-quants was added (and to fail if it's not). Before that parts of the tensors might have been corrupted or possibly it would cause GGML to read/write memory out of bounds.

So even if it might have seemed like it was working, there were probably issues and unfortunately it couldn't be left in the current state.

TheBloke

TheBloke commented on Jun 20, 2023

@TheBloke
Contributor

Ah I see! I guess I should delete those k-quants from the OpenAssistant-based repos then.

Thanks for the details.

ymcui

ymcui commented on Jun 26, 2023

@ymcui
Contributor

Greetings from Chinese-LLaMA-Alpaca project.

I would like to report that after PR #1921 being merged into main branch, our models can no longer be quantized in k-quants series, while they are functional before that PR.

The reason is that the vocabulary size of our model is not divisible by 256. For example, our Chinese Alpaca model has a vocabulary size of 49954. As k-quants series generally has better performance, it is really a pity that we can no longer use this feature. Especially, for larger models (like 33B or 65B), q3_k or lower quantization method are exceptionally useful, as q4_0 or q5_0 won't fit into RAM for most people.

Looking forward to some workaround for this in the future.

ikawrakow

ikawrakow commented on Jun 26, 2023

@ikawrakow
ContributorAuthor

@ymcui

I would like to report that after PR #1921 being merged into main branch, our models can no longer be quantized in k-quants series, while they are functional before that PR.

Sorry about that. If the model worked for you before #1921, the solution is to change https://github.com/ggerganov/llama.cpp/blob/447ccbe8c39332fcdd0d98a041b6e2ff6f06219d/llama.cpp#L2510
to

if (nx % QK_K != 0) {

Basically, some people wasted a lot of time trying to figure out why their models weren't working with k-quants. To prevent this while a solution is being worked on, I added this check in #1921. The check is in many cases more restrictive that it needs to be, but I wanted to be certain that nobody is wasting their time again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestmodelModel specific

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      Participants

      @debackerl@TheBloke@SlyEcho@ymcui@KerfuffleV2

      Issue actions

        Make k-quants work with tensor dimensions that are not multiple of 256 · Issue #1919 · ggml-org/llama.cpp