Skip to content

convert.py: Mistral models converted from tokenizer.json display <0x0A> instead of newlines. #4622

Closed
@TheBloke

Description

@TheBloke

This issue follows on from the discussions we had at the end of @strutive07 's PR which added support for tokenizer.json, here: #3633

Summary

Llama and Mistral models GGUF converted from tokenizer.json experience an issue with newlines, printing <0x0A instead of \n. The issue does not exist when tokenizer.model is used for the same model.

This represents an issue for some new fine tunes which do not include tokenizer.model. Sometimes this is simply a mistake, and the base model file can be used. But in some cases the models have extended or changed the vocab in tokenizer.json, and a new SPM model would need to be created. (Something that I've not yet been able to figure out how to do.)

Steps to reproduce

  1. Download any Llama or Mistral 7B repo which contains tokenizer.model and tokenizer.json:
pip3 install --upgrade 'huggingface-hub>=0.18'     # if not installed
huggingface-cli download mistralai/Mistral-7B-v0.1 --local-dir test-mistral  --local-dir-use-symlinks False
  1. Run convert.py on it, and verify that output is as expected. Because tokenizer.model is present, it will be used in preference to tokenizer.json, and no issue will exist.
$ ls -al /workspace/test-mistral/tokenizer.model
-rw-rw-r-- 1 quant quant 482K Dec 24 17:14 /workspace/test-mistral/tokenizer.model

$ python3 ./convert.py /workspace/test-mistral --outtype f16 --outfile /workspace/test-mistral/with-tokenizer.model.fp16.gguf

$ ./main -m /workspace/test-mistral/with-tokenizer.model.fp16.gguf -p "A haiku example is " -n 30 --temp 0

 A haiku example is 5-7-5 syllables.

The first line has five syllables, the second line has seven syllables and the
  1. Remove tokenizer.model to force tokenizer.json to be used and re-run convert.py
$ mv /workspace/test-mistral/tokenizer.model /workspace/test-mistral/dead.tokenizer.model

$ python3 ./convert.py /workspace/test-mistral --outtype f16 --outfile /workspace/test-mistral/no-tokenizer.model.fp16.gguf
  1. Test inference and note that \n is now represented as <0x0A> in the output:
$ ./main -m /workspace/test-mistral/no-tokenizer.model.fp16.gguf -p "A haiku example is " -n 30 --temp 0

 A haiku example is 5-7-5 syllables.<0x0A><0x0A>The first line has five syllables, the second line has seven syllables and the

Testing the same using Hugging Face transformers does not show an issue:

In [1]: import os
   ...: from transformers import AutoTokenizer
   ...: print ("tokenizer.model exists:", os.path.exists("/workspace/test-mistral/tokenizer.model"))
   ...: tokenizer = AutoTokenizer.from_pretrained("/workspace/test-mistral/")
   ...: encoded = tokenizer(""" A haiku example is 5-7-5 syllables.
   ...:
   ...: The first line has five syllables, the second line has seven syllables and the""")
   ...: print(f"Tokens: {encoded.input_ids}")
   ...: print(f"Decoded again: '{tokenizer.decode(encoded.input_ids)}'")
tokenizer.model exists: False
Tokens: [1, 28705, 330, 3631, 23550, 2757, 349, 28705, 28782, 28733, 28787, 28733, 28782, 5747, 584, 2561, 28723, 13, 13, 1014, 907, 1407, 659, 3359, 5747, 584, 2561, 28725, 272, 1676, 1407, 659, 6671, 5747, 584, 2561, 304, 272]
Decoded again: '<s>  A haiku example is 5-7-5 syllables.

The first line has five syllables, the second line has seven syllables and the'

In [2]: tokenizer.decode(13)
Out[2]: '\n'

Llama example

$ huggingface-cli download meta-llama/Llama-2-7b-chat-hf  --local-dir test-llama2  --local-dir-use-symlinks False

$ mv test-llama2/tokenizer.model test-llama2/dead.tokenizer.model

$ python3 ./convert.py /workspace/test-llama2 --outtype f16 --outfile /workspace/test-llama2/no-tokenizer.model.fp16.gguf

$ ./main -m /workspace/test-llama2/no-tokenizer.model.fp16.gguf -p "A haiku example is " -n 30 --temp 0

A haiku example is 5 syllables, 7 syllables, and 5 syllables.<0x0A>A haiku is a traditional form of Japanese poetry that

CC @ArthurZucker

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions