Skip to content

Conversation

BramVanroy
Copy link

This snippet yields an error:

python -c "
from huggingface_hub import snapshot_download;
snapshot_download(repo_id='microsoft/phi-2', local_dir='phi-2', local_dir_use_symlinks=False)
"
python convert-hf-to-gguf.py phi-2/ --outtype f16
Traceback (most recent call last):
  File "/home/local/vanroy/llama.cpp/convert-hf-to-gguf.py", line 3001, in <module>
    main()
  File "/home/local/vanroy/llama.cpp/convert-hf-to-gguf.py", line 2988, in main
    model_instance.set_vocab()
  File "/home/local/vanroy/llama.cpp/convert-hf-to-gguf.py", line 75, in set_vocab
    self._set_vocab_gpt2()
  File "/home/local/vanroy/llama.cpp/convert-hf-to-gguf.py", line 331, in _set_vocab_gpt2
    tokens, toktypes, tokpre = self.get_vocab_base()
  File "/home/local/vanroy/llama.cpp/convert-hf-to-gguf.py", line 242, in get_vocab_base
    tokpre = self.get_vocab_base_pre(tokenizer)
  File "/home/local/vanroy/llama.cpp/convert-hf-to-gguf.py", line 323, in get_vocab_base_pre
    raise NotImplementedError("BPE pre-tokenizer was not recognized - update get_vocab_base_pre()")
NotImplementedError: BPE pre-tokenizer was not recognized - update get_vocab_base_pre()

The proposed changes add support for phi-2, which uses CodeGenTokenizer, a BPE tokenizer.

closes #7022

@mofosyne mofosyne added model Model specific Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level labels May 15, 2024
@linpan
Copy link

linpan commented May 17, 2024

raise NotImplementedError("BPE pre-tokenizer was not recognized - update get_vocab_base_pre()")

NotImplementedError: BPE pre-tokenizer was not recognized - update get_vocab_base_pre()

res = "jina-v2-de"
if chkhsh == "fcace8b9cac38ce847670c970cd5892031a753a1ef381abd1d9af00f713da085":
# ref: https://huggingface.co/microsoft/phi-2
res = "phi-2"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@teleprint-me teleprint-me May 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ggerganov Thats not necessary. I solved this already in #7219 and ##7117.

@turian
Copy link

turian commented May 22, 2024

Hi @BramVanroy I was encouraging you in #7022 to test that HF and llama tokenization are identical. Here is a colab you could modify to try: https://colab.research.google.com/drive/1RYlEj2UhylYWyaASFo-LLATzZ8d29Z0T?usp=sharing

@BramVanroy
Copy link
Author

I'm unsure what has changed but it seems that phi-2 models are working again so that's good news. Will close this one for now.

@BramVanroy BramVanroy closed this Aug 27, 2024
@RhinoDevel
Copy link
Contributor

I'm unsure what has changed but it seems that phi-2 models are working again so that's good news. Will close this one for now.

Well, convert-hf-to-gguf-update.py still doesn't have a "phi-2" entry. The models should work with the default tokenizer, though (but that is the case for a long time).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
model Model specific Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Supporting phi-2 tokenizer
7 participants