Skip to content
This repository was archived by the owner on Jun 24, 2024. It is now read-only.

Add support for new k-quants quantization format #301

Closed
LLukas22 opened this issue Jun 7, 2023 · 7 comments · Fixed by #326
Closed

Add support for new k-quants quantization format #301

LLukas22 opened this issue Jun 7, 2023 · 7 comments · Fixed by #326
Labels
issue:enhancement New feature or request meta:help-wanted Extra attention is needed

Comments

@LLukas22
Copy link
Contributor

LLukas22 commented Jun 7, 2023

llama.cpp now supports new k-quants quantizations which achieve good model perplexity even in high quantizations. See ggml-org/llama.cpp#1684 .

We should also support these new quantization formats.

@KerfuffleV2
Copy link
Contributor

Is this actually not supported yet?

Either way, probably should keep this in mind: ggml-org/llama.cpp#1919

The solution there was to add checks in the quantization tool to prevent trying to quantize those tensors (or models) with k-quants but a project using GGML at a lower level would need to verify that the rows and columns are divisible by QK_K.

@nightscape
Copy link
Contributor

Is someone already on this?
I'm a Rust noob and would probably have to chicken out when it gets complicated, but I could at least start with it 😃
I would start by taking all #ifdef GGML_USE_K_QUANTS from llama.cpp, have GPT4 convert them to Rust and try to fit them in the llm codebase.

@KerfuffleV2
Copy link
Contributor

I think the first thing you'd need to do is to check if the llama.cpp binding llm depends on is compiling with k-quants. If it is, you probably don't have to do anything more than just add the k-quants types to enums where quantization types are currently listed. In the case of the quantization tool (and maybe loading models also) you'd have to check that the tensor rows/columns are divisible by QK_K (you can possibly just hardcode 256 there).

As long as the binding part is against a new enough version of GGML and it's getting compiled with k-quants this might be a simple change.

@nightscape
Copy link
Contributor

@KerfuffleV2 thanks for your input. I might be mistaken/misunderstanding, but I think llm only depends on ggml.c and reimplements the file loading from llama.cpp.
At least I'm finding places where the quantization types are listed as enums or constants.
@LLukas22 @philpax could you enlighten us?

@KerfuffleV2
Copy link
Contributor

I might be mistaken/misunderstanding, but I think llm only depends on ggml.c

Sorry, I might have been incorrect there. I thought I saw a pull a while back saying something like it was now depending on llama.cpp.

You could possibly look at the approach I use here: https://github.com/KerfuffleV2/ggml-sys-bleedingedge - well, actually two approaches. One is just to build the old style way and include the k-quants stuff unless it's disabled by a feature. The other way is to use cmake to build which makes stuff like compiling with CUDA simpler (although it doesn't help with determining how to link). For the latter approach, I was able to get a ggml_static target added so you actually don't have to link against llama.cpp when you only need ggml (but it will require linking with C++ because of BLAS stuff).

@LLukas22
Copy link
Contributor Author

@nightscape If we want to support the k-quants we probably have to wrap k_quants.h in a similar way to ggml.h (see here) and we probably have to compile ggml with activated k_quants in the build.rs file. If thats implemented we have to extend our enums and extend the quantization logic to call the correct wrapped functions.

@nightscape nightscape mentioned this issue Jun 22, 2023
@nightscape
Copy link
Contributor

See #326

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
issue:enhancement New feature or request meta:help-wanted Extra attention is needed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants