[llama] Add resegment post processing of tokenizer #2072

howard0su · 2023-07-02T14:06:41Z

Try to adress #2023

howard0su · 2023-07-02T14:07:52Z

@slaren can you give me some test cases?

Also, I rerun the test to see how many tokens would not tokenize to themselves. I got the different result than @vjeux.

slaren · 2023-07-02T14:45:29Z

I would suggest comparing the output with what sentepiecepiece produces. I don't know anything beyond what @vjeux commented.

howard0su · 2023-07-03T12:33:30Z

I read through the code. One thing which concerns me is this line in convert.py:
text = tokenizer.id_to_piece(i).replace("\u2581", " ").encode("utf-8")

when we replace "\u2581" with space, we lose some information.

give the original example in the related issue:
in SPM:

>>> encoder.encode("▁–")
[29871, 785]

in llama.cpp, token 785 is " ▁–" (there is a leading space). So llama cannot recognize this string is token 785. It will divide this into three chars as many other unicode chars are handling.

Need some insights from @ggerganov why we decide to do the replacement?

mqy · 2023-07-03T13:08:29Z

Need some insights from @ggerganov why we decide to do the replacement?

@howard0su FYI FIx parsing single-byte UTF-8 tokens by manually parsing the protobuf
Search u2581

TL;DR

howard0su · 2023-07-04T13:10:24Z

Sorry, I cannot find a test case that can hit the new segment code I added.

ggerganov · 2023-07-06T19:20:25Z

I have only brief understanding of how the tokenizer works, so won't be able to help much.
If you observe that your changes match the results of the original implementation then we can merge.

Maybe we can add even more test cases to the test-tokenizer-0 - why not even generate a bunch of random strings / bytes and verify that we produce the same tokens as the reference implementation.

howard0su requested a review from slaren July 2, 2023 14:06

howard0su added 5 commits July 4, 2023 21:08

[llama] Add resegment post processing of tokenizer

e818537

Add tests

6caa066

special test

751e51c

More tests

ca150b7

Add test

5f04a5d

howard0su force-pushed the compute_graph branch from 4cfd4bb to 5f04a5d Compare July 4, 2023 13:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[llama] Add resegment post processing of tokenizer #2072

[llama] Add resegment post processing of tokenizer #2072

howard0su commented Jul 2, 2023

howard0su commented Jul 2, 2023

slaren commented Jul 2, 2023

howard0su commented Jul 3, 2023 •

edited

Loading

mqy commented Jul 3, 2023

howard0su commented Jul 4, 2023

ggerganov commented Jul 6, 2023 •

edited

Loading

[llama] Add resegment post processing of tokenizer #2072

Are you sure you want to change the base?

[llama] Add resegment post processing of tokenizer #2072

Conversation

howard0su commented Jul 2, 2023

howard0su commented Jul 2, 2023

slaren commented Jul 2, 2023

howard0su commented Jul 3, 2023 • edited Loading

mqy commented Jul 3, 2023

howard0su commented Jul 4, 2023

ggerganov commented Jul 6, 2023 • edited Loading

howard0su commented Jul 3, 2023 •

edited

Loading

ggerganov commented Jul 6, 2023 •

edited

Loading