Skip to content

Speed up utf8 decoding #7068

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from
Closed

Speed up utf8 decoding #7068

wants to merge 2 commits into from

Conversation

data-man
Copy link
Contributor

@data-man data-man commented Nov 11, 2020

Extracted from #6390 and added isValidCodepoint and utf8CountCodepoints.
Added unicode/benchmark.zig.
Results for me (x86_64):

Name clz ReleaseFast clz ReleaseSmall switch ReleaseFast switch ReleaseSmall
short ASCII strings 332 MiB/s [3] 274 MiB/s [3] 384 MiB/s [3] 271 MiB/s [3]
short Unicode strings 414 MiB/s [3] 523 MiB/s [3] 601 MiB/s [3] 488 MiB/s [3]
pure ASCII strings 6729 MiB/s [80] 8990 MiB/s [80] 8501 MiB/s [80] 7485 MiB/s [80]
pure Unicode strings 608 MiB/s [80] 648 MiB/s [80] 691 MiB/s [80] 612 MiB/s [80]
mixed ASCII/Unicode strings 910 MiB/s [224] 1026 MiB/s [224] 973 MiB/s [224] 1094 MiB/s [224]
'boxes.txt' 364 MiB/s [6976] 383 MiB/s [6976] 509 MiB/s [6976] 506 MiB/s [6976]
'glasses.txt' 587 MiB/s [10014] 645 MiB/s [10014] 824 MiB/s [10014] 789 MiB/s [10014]
'quickbrown.txt' 730 MiB/s [3285] 843 MiB/s [3285] 1145 MiB/s [3285] 1086 MiB/s [3285]
'UTF-8-demo.txt' 363 MiB/s [7622] 385 MiB/s [7622] 470 MiB/s [7622] 470 MiB/s [7622]

@data-man
Copy link
Contributor Author

Name using clz using switch using LUT
short ASCII strings 332 MiB/s [3] 384 MiB/s [3] 358 MiB/s [3]
short Unicode strings 414 MiB/s [3] 601 MiB/s [3] 498 MiB/s [3]
pure ASCII strings 6729 MiB/s [80] 8501 MiB/s [80] 8512 MiB/s [80]
pure Unicode strings 608 MiB/s [80] 691 MiB/s [80] 728 MiB/s [80]
mixed ASCII/Unicode strings 910 MiB/s [224] 973 MiB/s [224] 690 MiB/s [224]
'boxes.txt' 364 MiB/s [6976] 509 MiB/s [6976] 430 MiB/s [6976]
'glasses.txt' 587 MiB/s [10014] 824 MiB/s [10014] 700 MiB/s [10014]
'quickbrown.txt' 730 MiB/s [3285] 1145 MiB/s [3285] 960 MiB/s [3285]
'UTF-8-demo.txt' 363 MiB/s [7622] 470 MiB/s [7622] 454 MiB/s [7622]
LUT code
const utf8_first_byte_lens = [256]u3{
    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, // 0x1F
    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, // 0x3F
    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, // 0x5F
    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, // 0x7F
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, // 0x9F
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, // 0xBF
    0, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
    2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, // 0xDF
    3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, // 0xEF
    4, 4, 4, 4, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, // 0xFF
};

pub fn utf8ByteSequenceLength(first_byte: u8) !u3 {
    const len = utf8_first_byte_lens[first_byte];
    if (len == 0) return error.Utf8InvalidStartByte;
    return len;
}

@daurnimator daurnimator added the standard library This issue involves writing Zig code for the standard library. label Nov 11, 2020
@LemonBoy
Copy link
Contributor

Cool, you stripped away my name out of this code and lumped everything in a single commit... A true class act.

You couldn't really wait for #6390 to be reviewed?

@andrewrk
Copy link
Member

Cool, you stripped away my name out of this code and lumped everything in a single commit... A true class act.

@data-man can you please explain what happened here?

@andrewrk andrewrk closed this Nov 11, 2020
@data-man
Copy link
Contributor Author

Cool, you stripped away my name out of this code

I referenced your PR:

Extracted from #6390

A true class act.

I'm sorry if it hurt your feelings... :(

@data-man
Copy link
Contributor Author

data-man commented Nov 11, 2020

@LemonBoy

You couldn't really wait for #6390 to be reviewed?

I don't understand why you have not splitted your PR into several parts.

// The switch is optimized much better than a "smart" approach using @clz

This comment is not a good reason. That is why I have given the real results.

@andrewrk

can you please explain what happened here?

I hope I have satisfied your request.

@LemonBoy
Copy link
Contributor

I referenced your PR:

You should've cherry-picked the commits you wanted (or apply the .patch file with git am and remove the extra commits), that's the standard way of picking-up a stalled patchset.

I don't understand why you have not splitted your PR into several parts.

Because it was not reviewed yet?
I also have another PR touching fmt.zig, I was waiting for one of them to land first to rebase the other.

This comment is not a good reason. That is why I have given the real results.

🤔 what?
I wrote that comment after profiling the code and running a few benchmarks, what else did you expect?

Rename utf8ValidCodepoint to isValidCodepoint

I kept the original naming structure, the idea was to introduce some kind of namespacing a-la crypto.zig with const utf8 = struct {...} and const utf16 = struct {...} but was deferred to a later PR.

@data-man
Copy link
Contributor Author

You should've cherry-picked the commits you wanted

git is still a dark forest for me. Sorry if I don't know how to do something.

@data-man data-man deleted the utf8_speedup branch November 11, 2020 08:23
@data-man
Copy link
Contributor Author

lol I'm 48 years old, excuse the old man. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
standard library This issue involves writing Zig code for the standard library.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants