Speed up utf8 decoding #7068

data-man · 2020-11-11T03:44:56Z

Extracted from #6390 and added isValidCodepoint and utf8CountCodepoints.
Added unicode/benchmark.zig.
Results for me (x86_64):

Name	clz ReleaseFast	clz ReleaseSmall	switch ReleaseFast	switch ReleaseSmall
short ASCII strings	332 MiB/s [3]	274 MiB/s [3]	384 MiB/s [3]	271 MiB/s [3]
short Unicode strings	414 MiB/s [3]	523 MiB/s [3]	601 MiB/s [3]	488 MiB/s [3]
pure ASCII strings	6729 MiB/s [80]	8990 MiB/s [80]	8501 MiB/s [80]	7485 MiB/s [80]
pure Unicode strings	608 MiB/s [80]	648 MiB/s [80]	691 MiB/s [80]	612 MiB/s [80]
mixed ASCII/Unicode strings	910 MiB/s [224]	1026 MiB/s [224]	973 MiB/s [224]	1094 MiB/s [224]
'boxes.txt'	364 MiB/s [6976]	383 MiB/s [6976]	509 MiB/s [6976]	506 MiB/s [6976]
'glasses.txt'	587 MiB/s [10014]	645 MiB/s [10014]	824 MiB/s [10014]	789 MiB/s [10014]
'quickbrown.txt'	730 MiB/s [3285]	843 MiB/s [3285]	1145 MiB/s [3285]	1086 MiB/s [3285]
'UTF-8-demo.txt'	363 MiB/s [7622]	385 MiB/s [7622]	470 MiB/s [7622]	470 MiB/s [7622]

data-man · 2020-11-11T04:06:17Z

Name	using clz	using switch	using LUT
short ASCII strings	332 MiB/s [3]	384 MiB/s [3]	358 MiB/s [3]
short Unicode strings	414 MiB/s [3]	601 MiB/s [3]	498 MiB/s [3]
pure ASCII strings	6729 MiB/s [80]	8501 MiB/s [80]	8512 MiB/s [80]
pure Unicode strings	608 MiB/s [80]	691 MiB/s [80]	728 MiB/s [80]
mixed ASCII/Unicode strings	910 MiB/s [224]	973 MiB/s [224]	690 MiB/s [224]
'boxes.txt'	364 MiB/s [6976]	509 MiB/s [6976]	430 MiB/s [6976]
'glasses.txt'	587 MiB/s [10014]	824 MiB/s [10014]	700 MiB/s [10014]
'quickbrown.txt'	730 MiB/s [3285]	1145 MiB/s [3285]	960 MiB/s [3285]
'UTF-8-demo.txt'	363 MiB/s [7622]	470 MiB/s [7622]	454 MiB/s [7622]

LUT code

const utf8_first_byte_lens = [256]u3{
    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, // 0x1F
    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, // 0x3F
    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, // 0x5F
    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, // 0x7F
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, // 0x9F
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, // 0xBF
    0, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
    2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, // 0xDF
    3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, // 0xEF
    4, 4, 4, 4, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, // 0xFF
};

pub fn utf8ByteSequenceLength(first_byte: u8) !u3 {
    const len = utf8_first_byte_lens[first_byte];
    if (len == 0) return error.Utf8InvalidStartByte;
    return len;
}

lib/std/unicode.zig

LemonBoy · 2020-11-11T07:17:46Z

Cool, you stripped away my name out of this code and lumped everything in a single commit... A true class act.

You couldn't really wait for #6390 to be reviewed?

andrewrk · 2020-11-11T07:25:05Z

Cool, you stripped away my name out of this code and lumped everything in a single commit... A true class act.

@data-man can you please explain what happened here?

data-man · 2020-11-11T07:31:35Z

Cool, you stripped away my name out of this code

I referenced your PR:

Extracted from #6390

A true class act.

I'm sorry if it hurt your feelings... :(

data-man · 2020-11-11T07:58:47Z

@LemonBoy

You couldn't really wait for #6390 to be reviewed?

I don't understand why you have not splitted your PR into several parts.

// The switch is optimized much better than a "smart" approach using @clz

This comment is not a good reason. That is why I have given the real results.

@andrewrk

can you please explain what happened here?

I hope I have satisfied your request.

LemonBoy · 2020-11-11T08:09:56Z

I referenced your PR:

You should've cherry-picked the commits you wanted (or apply the .patch file with git am and remove the extra commits), that's the standard way of picking-up a stalled patchset.

I don't understand why you have not splitted your PR into several parts.

Because it was not reviewed yet?
I also have another PR touching fmt.zig, I was waiting for one of them to land first to rebase the other.

This comment is not a good reason. That is why I have given the real results.

🤔 what?
I wrote that comment after profiling the code and running a few benchmarks, what else did you expect?

Rename utf8ValidCodepoint to isValidCodepoint

I kept the original naming structure, the idea was to introduce some kind of namespacing a-la crypto.zig with const utf8 = struct {...} and const utf16 = struct {...} but was deferred to a later PR.

data-man · 2020-11-11T08:23:13Z

You should've cherry-picked the commits you wanted

git is still a dark forest for me. Sorry if I don't know how to do something.

data-man · 2020-11-11T09:31:47Z

lol I'm 48 years old, excuse the old man. :)

Speed up utf8 decoding

9095089

daurnimator added the standard library This issue involves writing Zig code for the standard library. label Nov 11, 2020

daurnimator reviewed Nov 11, 2020

View reviewed changes

lib/std/unicode.zig Show resolved Hide resolved

xackus suggested changes Nov 11, 2020

View reviewed changes

lib/std/unicode.zig Outdated Show resolved Hide resolved

lib/std/unicode.zig Outdated Show resolved Hide resolved

Rename utf8ValidCodepoint to isValidCodepoint

8b76740

andrewrk closed this Nov 11, 2020

data-man deleted the utf8_speedup branch November 11, 2020 08:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up utf8 decoding #7068

Speed up utf8 decoding #7068

data-man commented Nov 11, 2020 •

edited

Loading

data-man commented Nov 11, 2020

LemonBoy commented Nov 11, 2020

andrewrk commented Nov 11, 2020

data-man commented Nov 11, 2020

data-man commented Nov 11, 2020 •

edited

Loading

LemonBoy commented Nov 11, 2020

data-man commented Nov 11, 2020

data-man commented Nov 11, 2020

Speed up utf8 decoding #7068

Speed up utf8 decoding #7068

Conversation

data-man commented Nov 11, 2020 • edited Loading

data-man commented Nov 11, 2020

LemonBoy commented Nov 11, 2020

andrewrk commented Nov 11, 2020

data-man commented Nov 11, 2020

data-man commented Nov 11, 2020 • edited Loading

LemonBoy commented Nov 11, 2020

data-man commented Nov 11, 2020

data-man commented Nov 11, 2020

data-man commented Nov 11, 2020 •

edited

Loading

data-man commented Nov 11, 2020 •

edited

Loading