Skip to content

Adding new non-latin tokens to the T5 tokenizer creates unnecessary whitespaces #26101

@sl5035

Description

@sl5035

System Info

  • transformers version: 4.27.4
  • Platform: Linux-5.4.0-99-generic-x86_64-with-glibc2.29
  • Python version: 3.8.10
  • Huggingface_hub version: 0.15.1
  • PyTorch version (GPU?): 2.0.1+cu117 (True)
  • Tensorflow version (GPU?): 2.12.0 (True)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: No
  • Using distributed or parallel set-up in script?: No

Who can help?

@Arthur

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Hi, I am trying to add tokens to the Pix2structProcessor which uses the T5 tokenizer. I have added non-latin characters (Korean) to the tokenizer using the below script so the len(processor.tokenizer.get_vocab()) becomes 65536 (from its original 50344).

processor = Pix2StructProcessor.from_pretrained("deplot_models/deplot_base_model/", is_vqa=True)
model = Pix2StructForConditionalGeneration.from_pretrained(
    "deplot_models/deplot_base_model/", is_vqa=True
)

with open("data/full_vocab.txt", "r+") as f:
    full_v = [v.strip("\n") for v in f.readlines()]
new_t = full_v[50345:]

processor.tokenizer.add_tokens(new_t)

print("Processor loaded!")

The problem arises when I try to tokenize some korean sentences using the extended tokenizer:
processor.tokenizer.tokenize('토크나이저 테스트 중입니다') outputs ['▁', '토', '크', '나이', '▁저', '▁', '테', '스트', '▁중', '입니다'].

The fifth token should be '저' instead of '▁저' (with the underscore) since I've added both of them to the vocab, but instead the tokenizer outputs the wrong version.

TL;DR

  1. Extend the T5 tokenizer using non-latin characters.
  2. Tokenize a sentence.
  3. Error happens.

Expected behavior

The original T5 tokenizer (without extending vocabs) outputs correctly:
['▁', '토', '크', '나', '이', '저', '▁', '테', '스트', '▁중', '입니다']

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions