Add phi-2 tokenizer #7300

BramVanroy · 2024-05-15T10:56:29Z

This snippet yields an error:

python -c "
from huggingface_hub import snapshot_download;
snapshot_download(repo_id='microsoft/phi-2', local_dir='phi-2', local_dir_use_symlinks=False)
"
python convert-hf-to-gguf.py phi-2/ --outtype f16

Traceback (most recent call last):
  File "/home/local/vanroy/llama.cpp/convert-hf-to-gguf.py", line 3001, in <module>
    main()
  File "/home/local/vanroy/llama.cpp/convert-hf-to-gguf.py", line 2988, in main
    model_instance.set_vocab()
  File "/home/local/vanroy/llama.cpp/convert-hf-to-gguf.py", line 75, in set_vocab
    self._set_vocab_gpt2()
  File "/home/local/vanroy/llama.cpp/convert-hf-to-gguf.py", line 331, in _set_vocab_gpt2
    tokens, toktypes, tokpre = self.get_vocab_base()
  File "/home/local/vanroy/llama.cpp/convert-hf-to-gguf.py", line 242, in get_vocab_base
    tokpre = self.get_vocab_base_pre(tokenizer)
  File "/home/local/vanroy/llama.cpp/convert-hf-to-gguf.py", line 323, in get_vocab_base_pre
    raise NotImplementedError("BPE pre-tokenizer was not recognized - update get_vocab_base_pre()")
NotImplementedError: BPE pre-tokenizer was not recognized - update get_vocab_base_pre()

The proposed changes add support for phi-2, which uses CodeGenTokenizer, a BPE tokenizer.

closes #7022

linpan · 2024-05-17T03:22:20Z

raise NotImplementedError("BPE pre-tokenizer was not recognized - update get_vocab_base_pre()")

NotImplementedError: BPE pre-tokenizer was not recognized - update get_vocab_base_pre()

ggerganov · 2024-05-17T07:21:34Z

convert-hf-to-gguf.py

            res = "jina-v2-de"
+        if chkhsh == "fcace8b9cac38ce847670c970cd5892031a753a1ef381abd1d9af00f713da085":
+            # ref: https://huggingface.co/microsoft/phi-2
+            res = "phi-2"


This new pre-tokenizer has to be handled in llama.cpp:

https://github.com/ggerganov/llama.cpp/blob/e18bc6aaf3b547890609ed254ee5248e720e5840/llama.cpp#L4414-L4475

@ggerganov Thats not necessary. I solved this already in #7219 and ##7117.

turian · 2024-05-22T19:54:59Z

Hi @BramVanroy I was encouraging you in #7022 to test that HF and llama tokenization are identical. Here is a colab you could modify to try: https://colab.research.google.com/drive/1RYlEj2UhylYWyaASFo-LLATzZ8d29Z0T?usp=sharing

BramVanroy · 2024-08-27T14:31:38Z

I'm unsure what has changed but it seems that phi-2 models are working again so that's good news. Will close this one for now.

RhinoDevel · 2024-08-28T11:58:10Z

I'm unsure what has changed but it seems that phi-2 models are working again so that's good news. Will close this one for now.

Well, convert-hf-to-gguf-update.py still doesn't have a "phi-2" entry. The models should work with the default tokenizer, though (but that is the case for a long time).

BramVanroy added 2 commits May 15, 2024 12:52

add phi 2 hash

0c6ae12

Update convert-hf-to-gguf-update.py

4ac22a8

mofosyne added model Model specific Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level labels May 15, 2024

ggerganov reviewed May 17, 2024

View reviewed changes

teleprint-me mentioned this pull request May 17, 2024

chore: Add model vocab support #7117

Closed

RhinoDevel mentioned this pull request Jul 30, 2024

Support phi-2, phi-1.5 and phi-1 GGUF creation, add vocab. results. #8777

Closed

4 tasks

Isaachhh mentioned this pull request Aug 12, 2024

Convert Bunny-v1.0-3B to GGUF BAAI-DCAI/Bunny#115

Closed

BramVanroy closed this Aug 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add phi-2 tokenizer #7300

Add phi-2 tokenizer #7300

Uh oh!

BramVanroy commented May 15, 2024

Uh oh!

linpan commented May 17, 2024

Uh oh!

ggerganov May 17, 2024

Uh oh!

teleprint-me May 17, 2024 •

edited

Loading

Uh oh!

turian commented May 22, 2024

Uh oh!

BramVanroy commented Aug 27, 2024

Uh oh!

RhinoDevel commented Aug 28, 2024

Uh oh!

Uh oh!

Add phi-2 tokenizer #7300

Add phi-2 tokenizer #7300

Uh oh!

Conversation

BramVanroy commented May 15, 2024

Uh oh!

linpan commented May 17, 2024

Uh oh!

ggerganov May 17, 2024

Choose a reason for hiding this comment

Uh oh!

teleprint-me May 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

turian commented May 22, 2024

Uh oh!

BramVanroy commented Aug 27, 2024

Uh oh!

RhinoDevel commented Aug 28, 2024

Uh oh!

Uh oh!

teleprint-me May 17, 2024 •

edited

Loading