Description
Following from discussions in the Llama 2 70B PR: #2276 :
Since that PR, converting Llama 2 70B models from Meta's original PTH format files works great.
But it is not possible to make usable Llama 2 70B models from HF format. The models convert and quantise fine, but always produce gibberish, as in this example:
### Human: write a story about llamas\n### Assistant:20 300202000 B00A0
It looks like the tensors gets transformed with the new permute using the GQA parameter num_local_key_value_heads and num_key_value_heads somehow:
https://github.com/huggingface/transformers/blob/b257c46a075419c09e5ce5c5aa39bc346ecdb9a5/src/transformers/models/llama/convert_llama_weights_to_hf.py#L173-L195
For reference, here are all the changes that happened in Transformers' convert_llama_weights_to_hf.py
for the Llama 2 release: huggingface/transformers@07360b6#diff-110a445233a8b15a0875998eeaf75cb8607b38a5daa736291dd058766879bbdd
Would anyone be able to look into this? It's a bit beyond my experience.
I'm getting multiple requests a day for 70B fine tune quants for FreeWilly 2, Llama2-Guanaco, and the newly released Airoboros 1.4.1 70B, and would love to be able to provide them for people.
Thanks in advance.
Activity
klosax commentedon Jul 24, 2023
It would indeed be very nice to be able to convert the 70b HF models.
I tried to look into it and figure out how it all worked but the needed skills are beyond me.
I think the permute() function in the transformer conversion script is getting reversed by the permute() in llama.cpp convert.py, but this function is not yet compatible with llama v2 which uses GQA.
KerfuffleV2 commentedon Jul 26, 2023
Have you tried just temporarily disabling
permute()
in the llama.cpp convert.py?Basically just
to
It may not work (or some kind of reshape might still be needed) but this should be a pretty easy one to at least try. I'd test it myself but I don't have the 16bit 70B at hand.
klosax commentedon Jul 26, 2023
No I dont think that that would work since all pth models, including the 70b, is transformed by the HF permute(). I guess we need a new permute() in convert.py to reverse it.
7erminalVelociraptor commentedon Jul 26, 2023
I can try it out later today, I have the airoboros 70b finetune in HF on my desktop and I think the base llama2 also.
MrJackSpade commentedon Jul 26, 2023
What's the level of effort on porting the permutation changes?
Does it require a lot of knowledge, or is it a straight transposition?
klosax commentedon Jul 26, 2023
Here is the new HF conversion script that converts the original pth models to HF format:
https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/convert_llama_weights_to_hf.py
This is the old HF conversion script:
https://github.com/huggingface/transformers/blob/feb83521eca849731573dd40da89a02e4f370e5a/src/transformers/models/llama/convert_llama_weights_to_hf.py
Now, llama.cpp want the tensors in the pth layout so any transformations made to HF tensors need to be reversed in convert.py
mj-shifu commentedon Jul 27, 2023
Hello,
I think I've managed to alter the conversion script so that the converted model does not produce gibberish any more:
https://github.com/mj-shifu/llama.cpp/blob/e15a67d6b21c10326a5cc74ab6d6ce9b8d7702bb/convert.py
I am not sure if it is correct but the converted Huggingface model produces exactly the same outputs as the converted pth model when sampling is disabled.
@TheBloke Could you probably try that out?
I'm sorry for any mistakes, this is my first time contributing here.
TheBloke commentedon Jul 27, 2023
Wonderful, thank you! I am having dinner now but will check as soon as I am at my pc
KerfuffleV2 commentedon Jul 27, 2023
Nice, I was working on this as well and what you did is very close to my approach. So theoretically if I'm not an idiot that is a good sign.
I'd suggest making the type for
n_kv_head
Optional[int]
since it gets set toNone
. Also, then you don't even need a conditional to populate it, you can just don_kv_head = config.get("num_key_value_heads")
mj-shifu commentedon Jul 27, 2023
Thanks for the suggestions. I just updated the code accordingly. Although I also think that the similarities are a good sign, I'm sorry that I caused duplicated work.
https://github.com/mj-shifu/llama.cpp/blob/01d16e1a1efced0cfbe92ed0c94c8003d22dbe54/convert.py
KerfuffleV2 commentedon Jul 27, 2023
Absolutely no need to apologize. I'm happy to see someone else got it done! Assuming it works, you should make a pull with these changes! I can't test whether it actually works, but for whatever my opinion is worth the code looks very reasonable.
If you wanted to reduce duplication a bit you could try something like:
mj-shifu commentedon Jul 27, 2023
Thank you! I should probably wait for @TheBloke's result until I make a pull request, shouldn't I? I have only tested it with a Huggingface model that I converted locally from the 70B pth model using the official Huggingface script.
klosax commentedon Jul 27, 2023
if n_kv_head is used, n_head should not be divided by n_kv_head in the third parameter.
Something like this looks correct to me:
KerfuffleV2 commentedon Jul 27, 2023
I don't think either approach is wrong, so if you're more comfortable with that then it's perfectly fine.
Pull requests have to get approved by someone before they're merged, so as long as you added a note that it was still being tested there wouldn't be a danger of it instantly getting merged with problems. It would also be possible to create a draft pull request that can't be merged until you set it ready for review.
It's generally easier to discuss actual code changes when they're in a pull request so that's why if it was me I'd probably just go ahead and create a pull (even a draft one).
I finally finished downloading a HF version ( https://huggingface.co/stabilityai/StableBeluga2 - previously know as FreeWilly2, they just random renamed it recently ) and am converting it but it's going to take a while and then it needs to be quantized. I'll let you know the results once it's completed and I can try to run inference. TheBloke might beat me to it even if he starts much later though. :)
mj-shifu commentedon Jul 27, 2023
@klosax Yes, but I think you can even remove the
dim1
because// (n_head // n_kv_head)
gets// n_head * n_kv_head
, which is what we want.@KerfuffleV2 That's very cool! I'll make a pull request.
mkroman commentedon Jul 27, 2023
I'd wager the renaming was due to trademark issues.
I've converted it with https://github.com/mj-shifu/llama.cpp/blob/01d16e1a1efced0cfbe92ed0c94c8003d22dbe54/convert.py and quantized it for Q4_K_S and have uploaded it.
It is available here (for an unknown amount of time, but at least for as long as this PR):
https://storage.labs.rwx.im/llm/stable-beluga-2/stable-beluga-2-ggml/stable-beluga-2-q4_k_s-ggml.bin
It'll be a bit before I can test it myself, but feel free to try the link if it's faster. It's ~36.2 GiB.
SHASUMs:
mj-shifu commentedon Jul 27, 2023
@mkroman Your converted model works with GGML!
mkroman commentedon Jul 27, 2023
Yeah, just confirmed it myself. It looks very promising - thanks for the patch :)
Output
TheBloke commentedon Jul 27, 2023
Wonderful work @mj-shifu ! My Stable Beluga 2 GGMLs are uploading and I will soon do Airoboros 70B and Guanaco 70B
Thanks so much for getting this done. Amazing first contribution! :)