-
Notifications
You must be signed in to change notification settings - Fork 1k
Description
Question
This is probably a known issue, as I'm aware that this project lags a bit behind the fast changes being made in the python transformers library, but I wanted to document a specific compatibility issue I hit:
Tokenizers 0.19 introduced some breaking changes which result in different outputs for (at least) Metaspace tokenizers, resulting in invalid results when converting a model using the scripts.convert script with newer transformers version. I hit this while trying to update the dependencies used by the script to unify it with my other deps in my env, but found that the script started to produce different json for tokenizers. In tokenizer.json, the pre_tokenizers and decoder appear now with a split
field instead of add_prefix_space
:
< "prepend_scheme": "always",
< "split": true
---
> "add_prefix_space": true,
> "prepend_scheme": "always"
Breaking changes: