Search code, repositories, users, issues, pull requests...

Contributor

It would indeed be very nice to be able to convert the 70b HF models.

I tried to look into it and figure out how it all worked but the needed skills are beyond me.

I think the permute() function in the transformer conversion script is getting reversed by the permute() in llama.cpp convert.py, but this function is not yet compatible with llama v2 which uses GQA.

added

Collaborator

I think the permute() function in the transformer conversion script is getting reversed by the permute() in llama.cpp convert.py

Have you tried just temporarily disabling permute() in the llama.cpp convert.py?

Basically just

def permute(weights: NDArray, n_head: int) -> NDArray:
    return (weights.reshape(n_head, 2, weights.shape[0] // n_head // 2, *weights.shape[1:])
                   .swapaxes(1, 2)
                   .reshape(weights.shape))

def permute(weights: NDArray, n_head: int) -> NDArray:
    return weights

It may not work (or some kind of reshape might still be needed) but this should be a pretty easy one to at least try. I'd test it myself but I don't have the 16bit 70B at hand.

Contributor

Have you tried just temporarily disabling permute() in the llama.cpp convert.py?

No I dont think that that would work since all pth models, including the 70b, is transformed by the HF permute(). I guess we need a new permute() in convert.py to reverse it.

7erminalVelociraptor

I think the permute() function in the transformer conversion script is getting reversed by the permute() in llama.cpp convert.py

Have you tried just temporarily disabling permute() in the llama.cpp convert.py?

Basically just
def permute(weights: NDArray, n_head: int) -> NDArray:
    return (weights.reshape(n_head, 2, weights.shape[0] // n_head // 2, *weights.shape[1:])
                   .swapaxes(1, 2)
                   .reshape(weights.shape))
to
def permute(weights: NDArray, n_head: int) -> NDArray:
    return weights
It may not work (or some kind of reshape might still be needed) but this should be a pretty easy one to at least try. I'd test it myself but I don't have the 16bit 70B at hand.

I can try it out later today, I have the airoboros 70b finetune in HF on my desktop and I think the base llama2 also.

MrJackSpade

What's the level of effort on porting the permutation changes?

Does it require a lot of knowledge, or is it a straight transposition?

https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/convert_llama_weights_to_hf.py

Contributor

What's the level of effort on porting the permutation changes?

Here is the new HF conversion script that converts the original pth models to HF format:

This is the old HF conversion script:

https://github.com/huggingface/transformers/blob/feb83521eca849731573dd40da89a02e4f370e5a/src/transformers/models/llama/convert_llama_weights_to_hf.py

Now, llama.cpp want the tensors in the pth layout so any transformations made to HF tensors need to be reversed in convert.py

https://github.com/mj-shifu/llama.cpp/blob/e15a67d6b21c10326a5cc74ab6d6ce9b8d7702bb/convert.py

Contributor

Hello,

I think I've managed to alter the conversion script so that the converted model does not produce gibberish any more:

I am not sure if it is correct but the converted Huggingface model produces exactly the same outputs as the converted pth model when sampling is disabled.

@TheBloke Could you probably try that out?

I'm sorry for any mistakes, this is my first time contributing here.

TheBloke

ContributorAuthor

Wonderful, thank you! I am having dinner now but will check as soon as I am at my pc

Collaborator

I think I've managed to alter the conversion script so that the converted model does not produce gibberish any more

Nice, I was working on this as well and what you did is very close to my approach. So theoretically if I'm not an idiot that is a good sign.

I'd suggest making the type for n_kv_head Optional[int] since it gets set to None. Also, then you don't even need a conditional to populate it, you can just do n_kv_head = config.get("num_key_value_heads")

https://github.com/mj-shifu/llama.cpp/blob/01d16e1a1efced0cfbe92ed0c94c8003d22dbe54/convert.py

Contributor

Thanks for the suggestions. I just updated the code accordingly. Although I also think that the similarities are a good sign, I'm sorry that I caused duplicated work.

Collaborator

Absolutely no need to apologize. I'm happy to see someone else got it done! Assuming it works, you should make a pull with these changes! I can't test whether it actually works, but for whatever my opinion is worth the code looks very reasonable.

If you wanted to reduce duplication a bit you could try something like:

def permute(weights: NDArray, n_head: int, n_kv_head: Optional[int] = None) -> NDArray:
    dim1 = weights.shape[0]
    if n_kv_head is not None and n_kv_head != n_head:
        dim1 *= n_kv_head
        n_head //= n_kv_head
    return (weights.reshape(n_head, 2, dim1 // n_head // 2, *weights.shape[1:])
                .swapaxes(1, 2)
                .reshape(weights.shape))

Contributor

Thank you! I should probably wait for @TheBloke's result until I make a pull request, shouldn't I? I have only tested it with a Huggingface model that I converted locally from the 70B pth model using the official Huggingface script.

Contributor

If you wanted to reduce duplication a bit you could try something like:

if n_kv_head is used, n_head should not be divided by n_kv_head in the third parameter.

Something like this looks correct to me:

def permute(weights: NDArray, n_head: int, n_kv_head: Optional[int] = None) -> NDArray:
    dim1 = weights.shape[0] // n_head // 2

    if n_kv_head is not None and n_kv_head != n_head:
        dim1 *= k_kv_head
        n_head //= n_kv_head

    return (weights.reshape(n_head, 2, dim1, *weights.shape[1:])
                    .swapaxes(1, 2)
                    .reshape(weights.shape))

Collaborator

I should probably wait for @TheBloke's result until I make a pull request, shouldn't I?

I don't think either approach is wrong, so if you're more comfortable with that then it's perfectly fine.

Pull requests have to get approved by someone before they're merged, so as long as you added a note that it was still being tested there wouldn't be a danger of it instantly getting merged with problems. It would also be possible to create a draft pull request that can't be merged until you set it ready for review.

It's generally easier to discuss actual code changes when they're in a pull request so that's why if it was me I'd probably just go ahead and create a pull (even a draft one).

I finally finished downloading a HF version ( https://huggingface.co/stabilityai/StableBeluga2 - previously know as FreeWilly2, they just random renamed it recently ) and am converting it but it's going to take a while and then it needs to be quantized. I'll let you know the results once it's completed and I can try to run inference. TheBloke might beat me to it even if he starts much later though. :)

https://storage.labs.rwx.im/llm/stable-beluga-2/stable-beluga-2-ggml/stable-beluga-2-q4_k_s-ggml.bin

Contributor

@klosax Yes, but I think you can even remove the dim1 because // (n_head // n_kv_head) gets // n_head * n_kv_head, which is what we want.

def permute(weights: NDArray, n_head: int, n_kv_head: Optional[int] = None) -> NDArray:
    if n_kv_head is not None and n_head != n_kv_head:
        n_head //= n_kv_head
    return (weights.reshape(n_head, 2, weights.shape[0] // n_head // 2, *weights.shape[1:])
                .swapaxes(1, 2)
                .reshape(weights.shape))

@KerfuffleV2 That's very cool! I'll make a pull request.

mkroman

I finally finished downloading a HF version ( https://huggingface.co/stabilityai/StableBeluga2 - previously know as FreeWilly2, they just random renamed it recently ) and am converting it but it's going to take a while and then it needs to be quantized. I'll let you know the results once it's completed and I can try to run inference. TheBloke might beat me to it even if he starts much later though. :)

I'd wager the renaming was due to trademark issues.

I've converted it with https://github.com/mj-shifu/llama.cpp/blob/01d16e1a1efced0cfbe92ed0c94c8003d22dbe54/convert.py and quantized it for Q4_K_S and have uploaded it.

It is available here (for an unknown amount of time, but at least for as long as this PR):

It'll be a bit before I can test it myself, but feel free to try the link if it's faster. It's ~36.2 GiB.

SHASUMs:

a65f23c4a43fc18e2bde619d3792599e090e3a6b  stable-beluga-2-q4_k_s-ggml.bin

Contributor

@mkroman Your converted model works with GGML!

llama.cpp/build/bin/main -m stable-beluga-2-q4_k_s-ggml.bin -t 18 -gqa 8 -f prompt.txt
main: build = 917 (1a94186)
main: seed  = 1690488588
llama.cpp: loading model from stable-beluga-2-q4_k_s-ggml.bin
llama_model_load_internal: warning: assuming 70B model based on GQA == 8
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 8192
llama_model_load_internal: n_mult     = 7168
llama_model_load_internal: n_head     = 64
llama_model_load_internal: n_head_kv  = 8
llama_model_load_internal: n_layer    = 80
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: n_gqa      = 8
llama_model_load_internal: rnorm_eps  = 5.0e-06
llama_model_load_internal: n_ff       = 28672
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 14 (mostly Q4_K - Small)
llama_model_load_internal: model size = 70B
llama_model_load_internal: ggml ctx size =    0.21 MB
llama_model_load_internal: mem required  = 37635.96 MB (+  160.00 MB per state)
llama_new_context_with_model: kv self size  =  160.00 MB

system_info: n_threads = 18 / 24 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = -1, n_keep = 0


 ### System:
This is a system prompt, please behave and help the user.

### User:
Write me a poem.

### Assistant:
 A gentle breeze caresses my face,
As I walk through the verdant maze,
Nature's beauty surrounds me here,
In this tranquil moment, away from fear.

The songbirds sing their harmonious tune,
And the sun casts its warm golden hue,
Dancing on leaves of lush green,
Creating magic so serene.

My heart is filled with joy and awe,
For this splendid paradise I strawl,
Here, I find love unfurled and true,
In this garden of peace, where my spirit flew. [end of text]

llama_print_timings:        load time =  1597.66 ms
llama_print_timings:      sample time =    60.97 ms /   138 runs   (    0.44 ms per token,  2263.30 tokens per second)
llama_print_timings: prompt eval time = 17151.98 ms /    37 tokens (  463.57 ms per token,     2.16 tokens per second)
llama_print_timings:        eval time = 75291.60 ms /   137 runs   (  549.57 ms per token,     1.82 tokens per second)
llama_print_timings:       total time = 92530.96 ms

mentioned this

on Jul 27, 2023

convert.py : Update to support 70B HF format model files #2427

mkroman

@mkroman Your converted model works with GGML!

Yeah, just confirmed it myself. It looks very promising - thanks for the patch :)

Output

% ./main -m ./models/stable-beluga-2-q4_k_s-ggml.bin -p "$(echo -en "### System:\nYou are Stable Beluga, an AI th
at follows instructions extremely well. Help as much as you can. Remember, be safe, and don't do anything illegal.\n\n### User:\nWhat is t
he most common way of transportation in Amsterdam?\n\n### Assistant:\n")" --no-mmap -n 400 -t 37 --gqa 8
main: build = 917 (1a94186)
main: seed  = 1690488472
llama.cpp: loading model from ./models/stable-beluga-2-q4_k_s-ggml.bin
llama_model_load_internal: warning: assuming 70B model based on GQA == 8
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000                                                                                             llama_model_load_internal: n_ctx      = 512                                                                                               llama_model_load_internal: n_embd     = 8192
llama_model_load_internal: n_mult     = 7168
llama_model_load_internal: n_head     = 64
llama_model_load_internal: n_head_kv  = 8
llama_model_load_internal: n_layer    = 80
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: n_gqa      = 8
llama_model_load_internal: rnorm_eps  = 5.0e-06
llama_model_load_internal: n_ff       = 28672
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 14 (mostly Q4_K - Small)
llama_model_load_internal: model size = 70B
llama_model_load_internal: ggml ctx size = 37070.96 MB
llama_model_load_internal: mem required  = 37635.96 MB (+  160.00 MB per state)
llama_new_context_with_model: kv self size  =  160.00 MB

system_info: n_threads = 37 / 40 | AVX = 1 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0
| F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.
000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 400, n_keep = 0


 ### System:
You are Stable Beluga, an AI that follows instructions extremely well. Help as much as you can. Remember, be safe, and don't do anything i
llegal.

### User:
What is the most common way of transportation in Amsterdam?

### Assistant:
 The most common way of transportation in Amsterdam is cycling. The city has an extensive network of bicycle paths and a large number of c
yclists, making it one of the most bike-friendly cities in the world. [end of text]

llama_print_timings:        load time = 107335.19 ms
llama_print_timings:      sample time =    41.41 ms /    50 runs   (    0.83 ms per token,  1207.32 tokens per second)
llama_print_timings: prompt eval time = 38123.01 ms /    67 tokens (  569.00 ms per token,     1.76 tokens per second)
llama_print_timings:        eval time = 33838.50 ms /    49 runs   (  690.58 ms per token,     1.45 tokens per second)
llama_print_timings:       total time = 72018.47 ms

Green-Sky

mentioned this

on Jul 27, 2023

LLAMA 2 70B convert fails with: failed to find n_mult for (n_ff=28672, n_embd=8192) #2286