Skip to content

compat with transformers >= 4.40 and tokenizers >= 0.19 #866

@joprice

Description

@joprice

Question

This is probably a known issue, as I'm aware that this project lags a bit behind the fast changes being made in the python transformers library, but I wanted to document a specific compatibility issue I hit:

Tokenizers 0.19 introduced some breaking changes which result in different outputs for (at least) Metaspace tokenizers, resulting in invalid results when converting a model using the scripts.convert script with newer transformers version. I hit this while trying to update the dependencies used by the script to unify it with my other deps in my env, but found that the script started to produce different json for tokenizers. In tokenizer.json, the pre_tokenizers and decoder appear now with a split field instead of add_prefix_space:

<         "prepend_scheme": "always",
<         "split": true
---
>         "add_prefix_space": true,
>         "prepend_scheme": "always"

Breaking changes:

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requested

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions