You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As discussed in #1602, k-quants do not work for the Falcon-7B model. This is due to the fact that the number of columns in many tensors (4544) is not divisible by 256, which is the super-block size of the k-quants.
It would be useful if k-quants could be adapted to work in such cases.
Does it make sense to improve older quantization algorithms if they are not state of the art? SqueezeLLM has just been released, beating GPTQ on perplexity and speed, and they also released their CUDA kernels.
Does it make sense to improve older quantization algorithms if they are not state of the art? SqueezeLLM has just been released, beating GPTQ on perplexity and speed, and they also released their CUDA kernels.
My concept is that k-quants are SOTA. The discussion in #1595 and the description of #1684 shed some light on why I think k-quants are SOTA. Perhaps you could elaborate on how you arrived at the conclusion that the SquezzeLLM approach is better?
This might be a crazy idea, but what if the remainder not divisible by QK_K was just left as f32 or f16 and not quantized at all?
It might need some special logic to handle but it would only apply for the very last partial "block" and it should be possible to calculate if that's necessary outside of any loops.
I'm on it. The current thinking is that I will add padding such that I have a multiple of 256. When quantizing, the values that are beyond the actual tensor size will be assumed to be zero. When de-quantizing, one needs to take care to not dequantize values beyond the actual tensor size. Same applies to dot products.
It is a pretty big change, so it will take some time.
The alternative that was proposed somewhere above is to use super-blocks of 64. This will work for Falcon-7b and for OpenLLaMa 3B. It is a potentially smaller change, but using super-blocks of 64 almost defeats the purpose of the super-blocks, which is to save bits by using quantized scales for the blocks inside a super-block. To give a specific example, with a super-block of 256 and Q4_K, we have 8 blocks of 32, each having a scale and a min of 6 bits, so that's 8 * 12 = 96 bits. We then have the fp16 scale and min of the super-block, which is another 32 bits for a total of 128 bits per super-block, or 0.5 bits of extra data per weight. For a super-block size of 64 we have 2 * 12 + 32 = 56 bits per super-block, or 0.875 bits per weight. That's almost the same as Q4_1 (1 extra bit per weight), so we might as well add to Q4_1 the rmse+cosine distance minimization that is used in Q4_K while quantizing and just use a modified version of Q4_1.
Just to add that, FYI, I just learned that of another type of model that's affected: certain Llama models based on OpenAssistant, which have a vocab size of 32016
convert.py:
Writing vocab...
[ 1/543] Writing tensor tok_embeddings.weight | size 32016 x 6656 | type UnquantizedDataType(name='F16')
[ 2/543] Writing tensor norm.weight | size 6656 | type UnquantizedDataType(name='F32')
[ 3/543] Writing tensor output.weight | size 32016 x 6656 | type UnquantizedDataType(name='F16')
[ 4/543] Writing tensor layers.0.attention.wq.weight | size 6656 x 6656 | type UnquantizedDataType(name='F16')
[ 5/543] Writing tensor layers.0.attention.wk.weight | size 6656 x 6656 | type UnquantizedDataType(name='F16')
...
quantize:
llama.cpp: loading model from /workspace/process/alpasta-30b/ggml/alpasta-30b.ggmlv3.fp16.bin
llama.cpp: saving model to /workspace/process/alpasta-30b/ggml/alpasta-30b.ggmlv3.q2_K.bin
========================= Tensor sizes 6656 x 32016 are not divisible by 256
This is required to be able to use k-quants for now!
========================================================================================
Out of interest, did something change with regards to this in the last week or two?
A check to make sure the sizes were compatible with k-quants was added (and to fail if it's not). Before that parts of the tensors might have been corrupted or possibly it would cause GGML to read/write memory out of bounds.
So even if it might have seemed like it was working, there were probably issues and unfortunately it couldn't be left in the current state.
I would like to report that after PR #1921 being merged into main branch, our models can no longer be quantized in k-quants series, while they are functional before that PR.
The reason is that the vocabulary size of our model is not divisible by 256. For example, our Chinese Alpaca model has a vocabulary size of 49954. As k-quants series generally has better performance, it is really a pity that we can no longer use this feature. Especially, for larger models (like 33B or 65B), q3_k or lower quantization method are exceptionally useful, as q4_0 or q5_0 won't fit into RAM for most people.
Looking forward to some workaround for this in the future.
I would like to report that after PR #1921 being merged into main branch, our models can no longer be quantized in k-quants series, while they are functional before that PR.
Basically, some people wasted a lot of time trying to figure out why their models weren't working with k-quants. To prevent this while a solution is being worked on, I added this check in #1921. The check is in many cases more restrictive that it needs to be, but I wanted to be certain that nobody is wasting their time again.
Activity
SlyEcho commentedon Jun 18, 2023
Same for OpenLLaMA 3B.
debackerl commentedon Jun 19, 2023
Does it make sense to improve older quantization algorithms if they are not state of the art? SqueezeLLM has just been released, beating GPTQ on perplexity and speed, and they also released their CUDA kernels.
https://github.com/SqueezeAILab/SqueezeLLM
Just my 2 cents :-)
ikawrakow commentedon Jun 19, 2023
My concept is that k-quants are SOTA. The discussion in #1595 and the description of #1684 shed some light on why I think k-quants are SOTA. Perhaps you could elaborate on how you arrived at the conclusion that the SquezzeLLM approach is better?
KerfuffleV2 commentedon Jun 19, 2023
This might be a crazy idea, but what if the remainder not divisible by
QK_K
was just left as f32 or f16 and not quantized at all?It might need some special logic to handle but it would only apply for the very last partial "block" and it should be possible to calculate if that's necessary outside of any loops.
ikawrakow commentedon Jun 19, 2023
I'm on it. The current thinking is that I will add padding such that I have a multiple of 256. When quantizing, the values that are beyond the actual tensor size will be assumed to be zero. When de-quantizing, one needs to take care to not dequantize values beyond the actual tensor size. Same applies to dot products.
It is a pretty big change, so it will take some time.
The alternative that was proposed somewhere above is to use super-blocks of 64. This will work for Falcon-7b and for OpenLLaMa 3B. It is a potentially smaller change, but using super-blocks of 64 almost defeats the purpose of the super-blocks, which is to save bits by using quantized scales for the blocks inside a super-block. To give a specific example, with a super-block of 256 and
Q4_K
, we have 8 blocks of 32, each having a scale and a min of 6 bits, so that's8 * 12 = 96 bits
. We then have thefp16
scale and min of the super-block, which is another 32 bits for a total of 128 bits per super-block, or 0.5 bits of extra data per weight. For a super-block size of 64 we have2 * 12 + 32 = 56
bits per super-block, or 0.875 bits per weight. That's almost the same asQ4_1
(1 extra bit per weight), so we might as well add toQ4_1
the rmse+cosine distance minimization that is used inQ4_K
while quantizing and just use a modified version ofQ4_1
.SlyEcho commentedon Jun 19, 2023
Padding per row should be possible, after all we store the row length and its size in bytes separately.
TheBloke commentedon Jun 20, 2023
Just to add that, FYI, I just learned that of another type of model that's affected: certain Llama models based on OpenAssistant, which have a vocab size of 32016
Example model exhibiting this: https://huggingface.co/MetaIX/GPT4-X-Alpasta-30b
Out of interest, did something change with regards to this in the last week or two? Because 11 days ago I quantised OpenAssistant-SFT-7 which also uses 32016 x 6656 and it quantised fine: https://huggingface.co/TheBloke/OpenAssistant-SFT-7-Llama-30B-GGML/tree/main
KerfuffleV2 commentedon Jun 20, 2023
A check to make sure the sizes were compatible with k-quants was added (and to fail if it's not). Before that parts of the tensors might have been corrupted or possibly it would cause GGML to read/write memory out of bounds.
So even if it might have seemed like it was working, there were probably issues and unfortunately it couldn't be left in the current state.
TheBloke commentedon Jun 20, 2023
Ah I see! I guess I should delete those k-quants from the OpenAssistant-based repos then.
Thanks for the details.
ymcui commentedon Jun 26, 2023
Greetings from Chinese-LLaMA-Alpaca project.
I would like to report that after PR #1921 being merged into
main
branch, our models can no longer be quantized ink-quants
series, while they are functional before that PR.The reason is that the vocabulary size of our model is not divisible by 256. For example, our Chinese Alpaca model has a vocabulary size of 49954. As
k-quants
series generally has better performance, it is really a pity that we can no longer use this feature. Especially, for larger models (like 33B or 65B),q3_k
or lower quantization method are exceptionally useful, asq4_0
orq5_0
won't fit into RAM for most people.Looking forward to some workaround for this in the future.
ikawrakow commentedon Jun 26, 2023
@ymcui
Sorry about that. If the model worked for you before #1921, the solution is to change https://github.com/ggerganov/llama.cpp/blob/447ccbe8c39332fcdd0d98a041b6e2ff6f06219d/llama.cpp#L2510
to
Basically, some people wasted a lot of time trying to figure out why their models weren't working with k-quants. To prevent this while a solution is being worked on, I added this check in #1921. The check is in many cases more restrictive that it needs to be, but I wanted to be certain that nobody is wasting their time again.