Closed
Description
This issue follows on from the discussions we had at the end of @strutive07 's PR which added support for tokenizer.json
, here: #3633
Summary
Llama and Mistral models GGUF converted from tokenizer.json
experience an issue with newlines, printing <0x0A
instead of \n
. The issue does not exist when tokenizer.model
is used for the same model.
This represents an issue for some new fine tunes which do not include tokenizer.model
. Sometimes this is simply a mistake, and the base model file can be used. But in some cases the models have extended or changed the vocab in tokenizer.json
, and a new SPM model would need to be created. (Something that I've not yet been able to figure out how to do.)
Steps to reproduce
- Download any Llama or Mistral 7B repo which contains
tokenizer.model
andtokenizer.json
:
pip3 install --upgrade 'huggingface-hub>=0.18' # if not installed
huggingface-cli download mistralai/Mistral-7B-v0.1 --local-dir test-mistral --local-dir-use-symlinks False
- Run
convert.py
on it, and verify that output is as expected. Becausetokenizer.model
is present, it will be used in preference totokenizer.json
, and no issue will exist.
$ ls -al /workspace/test-mistral/tokenizer.model
-rw-rw-r-- 1 quant quant 482K Dec 24 17:14 /workspace/test-mistral/tokenizer.model
$ python3 ./convert.py /workspace/test-mistral --outtype f16 --outfile /workspace/test-mistral/with-tokenizer.model.fp16.gguf
$ ./main -m /workspace/test-mistral/with-tokenizer.model.fp16.gguf -p "A haiku example is " -n 30 --temp 0
A haiku example is 5-7-5 syllables.
The first line has five syllables, the second line has seven syllables and the
- Remove
tokenizer.model
to forcetokenizer.json
to be used and re-runconvert.py
$ mv /workspace/test-mistral/tokenizer.model /workspace/test-mistral/dead.tokenizer.model
$ python3 ./convert.py /workspace/test-mistral --outtype f16 --outfile /workspace/test-mistral/no-tokenizer.model.fp16.gguf
- Test inference and note that
\n
is now represented as<0x0A>
in the output:
$ ./main -m /workspace/test-mistral/no-tokenizer.model.fp16.gguf -p "A haiku example is " -n 30 --temp 0
A haiku example is 5-7-5 syllables.<0x0A><0x0A>The first line has five syllables, the second line has seven syllables and the
Testing the same using Hugging Face transformers does not show an issue:
In [1]: import os
...: from transformers import AutoTokenizer
...: print ("tokenizer.model exists:", os.path.exists("/workspace/test-mistral/tokenizer.model"))
...: tokenizer = AutoTokenizer.from_pretrained("/workspace/test-mistral/")
...: encoded = tokenizer(""" A haiku example is 5-7-5 syllables.
...:
...: The first line has five syllables, the second line has seven syllables and the""")
...: print(f"Tokens: {encoded.input_ids}")
...: print(f"Decoded again: '{tokenizer.decode(encoded.input_ids)}'")
tokenizer.model exists: False
Tokens: [1, 28705, 330, 3631, 23550, 2757, 349, 28705, 28782, 28733, 28787, 28733, 28782, 5747, 584, 2561, 28723, 13, 13, 1014, 907, 1407, 659, 3359, 5747, 584, 2561, 28725, 272, 1676, 1407, 659, 6671, 5747, 584, 2561, 304, 272]
Decoded again: '<s> A haiku example is 5-7-5 syllables.
The first line has five syllables, the second line has seven syllables and the'
In [2]: tokenizer.decode(13)
Out[2]: '\n'
Llama example
$ huggingface-cli download meta-llama/Llama-2-7b-chat-hf --local-dir test-llama2 --local-dir-use-symlinks False
$ mv test-llama2/tokenizer.model test-llama2/dead.tokenizer.model
$ python3 ./convert.py /workspace/test-llama2 --outtype f16 --outfile /workspace/test-llama2/no-tokenizer.model.fp16.gguf
$ ./main -m /workspace/test-llama2/no-tokenizer.model.fp16.gguf -p "A haiku example is " -n 30 --temp 0
A haiku example is 5 syllables, 7 syllables, and 5 syllables.<0x0A>A haiku is a traditional form of Japanese poetry that
Metadata
Metadata
Assignees
Labels
No labels