Adding new non-latin tokens to the T5 tokenizer creates unnecessary whitespaces

### System Info

- `transformers` version: 4.27.4
- Platform: Linux-5.4.0-99-generic-x86_64-with-glibc2.29
- Python version: 3.8.10
- Huggingface_hub version: 0.15.1
- PyTorch version (GPU?): 2.0.1+cu117 (True)
- Tensorflow version (GPU?): 2.12.0 (True)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: No
- Using distributed or parallel set-up in script?: No

### Who can help?

@Arthur

### Information

- [ ] The official example scripts
- [X] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [X] My own task or dataset (give details below)

### Reproduction

Hi, I am trying to add tokens to the Pix2structProcessor which uses the T5 tokenizer. I have added non-latin characters (Korean) to the tokenizer using the below script so the `len(processor.tokenizer.get_vocab())` becomes 65536 (from its original 50344).

```
processor = Pix2StructProcessor.from_pretrained("deplot_models/deplot_base_model/", is_vqa=True)
model = Pix2StructForConditionalGeneration.from_pretrained(
    "deplot_models/deplot_base_model/", is_vqa=True
)

with open("data/full_vocab.txt", "r+") as f:
    full_v = [v.strip("\n") for v in f.readlines()]
new_t = full_v[50345:]

processor.tokenizer.add_tokens(new_t)

print("Processor loaded!")
```

The problem arises when I try to tokenize some korean sentences using the extended tokenizer:
`processor.tokenizer.tokenize('토크나이저 테스트 중입니다')` outputs `['▁', '토', '크', '나이', '▁저', '▁', '테', '스트', '▁중', '입니다']`.

The fifth token should be '저' instead of '▁저' (with the underscore) since I've added both of them to the vocab, but instead the tokenizer outputs the wrong version.

TL;DR
1. Extend the T5 tokenizer using non-latin characters.
2. Tokenize a sentence.
3. Error happens.

### Expected behavior

The original T5 tokenizer (without extending vocabs) outputs correctly:
`['▁', '토', '크', '나', '이', '저', '▁', '테', '스트', '▁중', '입니다']`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adding new non-latin tokens to the T5 tokenizer creates unnecessary whitespaces #26101

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Adding new non-latin tokens to the T5 tokenizer creates unnecessary whitespaces #26101

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions