-
Notifications
You must be signed in to change notification settings - Fork 12.8k
Closed
Labels
bugSomething isn't workingSomething isn't workinghigh priorityVery important issueVery important issue
Description
I am comparing the tokenization of the codellama repository with the infill example of this repository.
The first example prompt from the codellama repository consists of the strings:
- Prefix: 'def remove_non_ascii(s: str) -> str:\n """ '
- Suffix: '\n return result\n'
Comparing the tokenization of both implementations results in:
- CodeLlama: 1 32007 822 3349 29918 5464 29918 294 18869 29898 29879 29901 851 29897 1599 851 29901 13 1678 9995 29871 32008 13 1678 736 1121 13 32009
- Llama.cpp: 32007 1 822 3349 29918 5464 29918 294 18869 29898 29879 29901 851 29897 1599 851 29901 13 1678 9995 29871 32008 1 29871 13 1678 736 1121 13 32009
There are two differences:
- The first two tokens are swapped (those are
prefix_id
andbos
I think) - Llama.cpp adds a
bos
token again after thesuffix_id
token and an additional 29871 (is this a space?)
I believe the latter is definitely wrong, as the paper states on page 4:
To limit the distribution shift between autoregressive and infilling training, we suppress the implicit leading space that SentencePiece tokenizers add upon encoding the middle part and the suffix
goerch, ggerganov, FSSRepo, FNsi, lin72h and 1 more
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workinghigh priorityVery important issueVery important issue