std.fmt meets UTF-8 #6390

LemonBoy · 2020-09-21T14:22:45Z

This is a reboot of #5569 and #3970 with more polish on top of it.
You can still print ASCII chars (a-la printf c specifier) with c, use u for printing unicode codepoints.
You can now print UTF-8 encoded strings with the specified width/alignment.

ikskuh · 2020-09-21T14:34:07Z

lib/std/fmt.zig

+        if (@typeInfo(@TypeOf(int_value)).Int.bits <= 21) {
+            return formatUnicodeCodepoint(@as(u21, int_value), options, writer);
+        } else {
+            @compileError("Cannot print integer that is larger than 32 bits as an UTF-8 sequence");


This error message should contain "21 bits", not "32 bits"

oops, I've accidentally copy-pasted the message from the @compileError above

ikskuh · 2020-09-21T14:39:26Z

lib/std/fmt.zig

-        },
+    if (options.width) |min_width| {
+        // In case of error assume the buffer content is ASCII-encoded
+        const width = unicode.utf8CountCodepoints(buf) catch |_| buf.len;


This won't decide the string width! A codepoint is not a single character. Example:
"👩‍👦‍👦" is U+1F469 U+200D U+1F466 U+200D U+1F466, which has 5 codepoints, but only width 1

You can look that up with this tool: https://cryptii.com/pipes/unicode-lookup

This won't decide the string width!

That's a good approximation of the string width, the same approximation used by other PLs.
Entering the wcwidth territory and dealing with tables needing constant updates or mismatches between the producer (Zig, in this case) and the consumer (the terminal emulator/editor/browser) is definitely not something that I'd rank high on my todo list.

Surely it would be nice if a user could put a table in the root source file and std.unicode apis could use it

I think the best solution is to implement the runtime width specifier (see #1358) and let the user specify the display width, I'm playing with a prototype of this idea and it looks promising.

data-man · 2020-09-21T16:52:43Z

It would be nice if options.fill would be u21. Unicode has many nice fill symbols.

Rocknest · 2020-09-21T19:15:48Z

lib/std/fmt.zig

+    if (unicode.utf8ValidCodepoint(c)) {
+        var buf: [4]u8 = undefined;
+        // The codepoint is surely valid, hence the use of unreachable
+        const len = std.unicode.utf8Encode(@truncate(u21, c), &buf) catch |err| switch (err) {


c is already u21

data-man · 2020-09-22T11:54:10Z

Bad news.

benchmark.zig

const std = @import("std");
const time = std.time;
const Timer = time.Timer;

const count = 1_000_000;

pub fn main() !void {
    const stdout = std.io.getStdOut().writer();

    var buffer: [2048]u8 = undefined;
    var fixed = std.heap.FixedBufferAllocator.init(buffer[0..]);
    const args = try std.process.argsAlloc(&fixed.allocator);

    var i: usize = 1;
    while (i < args.len) : (i += 1) {
        const arg = args[i];
        try stdout.print("Format '{}'\n", .{arg});
        var timer = try Timer.start();
        const start = timer.lap();

        var j: usize = 0;
        while (j < count) : (j += 1) {
            const a = std.fmt.count("{:=^40}", .{arg});
            const b = std.fmt.count("{:=>40}", .{arg});
            const c = std.fmt.count("{:=<40}", .{arg});
        }

        const end = timer.read();

        const elapsed_s = @intToFloat(f64, end - start) / time.ns_per_s;
        const throughput = @floatToInt(u64, @intToFloat(f64, count) / elapsed_s);
        try stdout.print("Throughput: {}\n", .{throughput});
    }
}

$ ./benchmark 123aaaaaaaaaaaaaaaaaaaaaaaaaaa ddddddddddddddaaaaaaaaaaaaaaaaaa
master:

Format '123aaaaaaaaaaaaaaaaaaaaaaaaaaa'
Throughput: 62075612
Format 'ddddddddddddddaaaaaaaaaaaaaaaaaa'
Throughput: 74289364

This PR:

Format '123aaaaaaaaaaaaaaaaaaaaaaaaaaa'
Throughput: 1887236
Format 'ddddddddddddddaaaaaaaaaaaaaaaaaa'
Throughput: 1845342```

FireFox317 · 2020-09-22T12:34:47Z

@data-man Why bad news? This seems to be 30-40 times as fast as the master branch ?!

Edit: Nvm, throughput was printed instead of time

VojtechStep · 2020-09-22T12:47:31Z

Not really, it's labeled time, but the value printed is actually the throughput

FireFox317 · 2020-09-22T12:49:01Z

Ahh @VojtechStep, jup you are correct.

LemonBoy · 2020-09-22T13:26:10Z

Bad news.

It was not unexpected, doing more work requires more time :)
I can get the gap down from 40x to 4x by foregoing the codepoint validation, but that's it.

data-man · 2020-09-22T13:31:48Z

@LemonBoy

It was not unexpected

Of course, the numbers are just for discussion. :)

LemonBoy · 2020-09-27T14:42:46Z

@data-man, last commit should cut the slowdown by a noticeable amount, especially on mostly (or pure) ASCII strings.

FireFox317 · 2020-09-27T14:50:15Z

lib/std/unicode.zig

@@ -776,7 +795,7 @@ fn testUtf8CountCodepoints() !void {
    testing.expectEqual(@as(usize, 10), try utf8CountCodepoints("abcdefghij"));
    testing.expectEqual(@as(usize, 10), try utf8CountCodepoints("äåéëþüúíóö"));
    testing.expectEqual(@as(usize, 5), try utf8CountCodepoints("こんにちは"));
-    testing.expectError(error.Utf8EncodesSurrogateHalf, utf8CountCodepoints("\xED\xA0\x80"));
+    // testing.expectError(error.Utf8EncodesSurrogateHalf, utf8CountCodepoints("\xED\xA0\x80"));


Commented out code?

My bad, I'll add it back

data-man · 2020-09-27T14:54:37Z

Nice!
But I suggest to remove width calculation when #6411 will merged.

Make the code easier for the optimizer to work with and introduce a fast path for ASCII sequences. Introduce a benchmark harness to start tracking the performance of ops on utf8.

Vexu · 2020-11-19T13:56:47Z

lib/std/fmt.zig

+    if (unicode.utf8ValidCodepoint(c)) {
+        var buf: [4]u8 = undefined;
+        // The codepoint is surely valid, hence the use of unreachable
+        const len = std.unicode.utf8Encode(c, &buf) catch |err| switch (err) {
+            error.Utf8CannotEncodeSurrogateHalf, error.CodepointTooLarge => unreachable,
+        };
+        return formatBuf(buf[0..len], options, writer);
+    }
+
+    // In case of error output the replacement char U+FFFD
+    return formatBuf(&[_]u8{ 0xef, 0xbf, 0xbd }, options, writer);


Why not just output the replacement char if utf8Encode returns an error?

hm? 0xef, 0xbf, 0xbd is the UTF-8 encoded replacement char

I meant that utf8Encode returns an error for invalid input so there should be no need to validate before it?

Oh right, GH won't let me see the lines around this.
Yes that's better, this check must be a leftover for some error-free utf8Encode alternative I was toying with.

ikskuh suggested changes Sep 21, 2020

View reviewed changes

Rocknest reviewed Sep 21, 2020

View reviewed changes

andrewrk self-assigned this Sep 21, 2020

FireFox317 reviewed Sep 27, 2020

View reviewed changes

g-w1 added a commit to g-w1/ezc that referenced this pull request Oct 11, 2020

try unicode but need ziglang/zig#6390 to be merged

cb5a570

LemonBoy force-pushed the reboot-3970 branch from 99f769b to 4b51a20 Compare October 23, 2020 13:34

data-man and others added 9 commits November 5, 2020 16:10

Add 'u' specifier to std.format

678ecc9

Update the API and add add error-recovery path

2cce230

std: Introduce std.unicode.utf8CountCodepoints

6c4efab

std: Introduce std.unicode.utf8ValidCodepoint

44533f1

Clean up the unicode codepoint formatter a bit

675de8d

Make std.formatBuf UTF-8 aware

0316ac9

Fix typo in documentation

1982e0c

Address review comments

3a1f515

std: Make utf8CountCodepoints much faster

53c1624

Make the code easier for the optimizer to work with and introduce a fast path for ASCII sequences. Introduce a benchmark harness to start tracking the performance of ops on utf8.

LemonBoy force-pushed the reboot-3970 branch from 4b51a20 to 53c1624 Compare November 5, 2020 15:10

data-man mentioned this pull request Nov 11, 2020

Speed up utf8 decoding #7068

Closed

Vexu reviewed Nov 19, 2020

View reviewed changes

Nicer code for the error code path

60638f0

andrewrk merged commit 473cb1f into ziglang:master Nov 19, 2020

andrewrk mentioned this pull request Nov 19, 2020

std.fmt c specifier should print numbers as unicode codepoints #5564

Closed

Vexu mentioned this pull request Jan 14, 2024

std.fmt: Clarify that width is measured in Unicode Codepoints. #18536

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

std.fmt meets UTF-8 #6390

std.fmt meets UTF-8 #6390

LemonBoy commented Sep 21, 2020

ikskuh Sep 21, 2020

LemonBoy Sep 21, 2020

ikskuh Sep 21, 2020

LemonBoy Sep 21, 2020

Rocknest Sep 21, 2020

LemonBoy Sep 21, 2020

data-man commented Sep 21, 2020

Rocknest Sep 21, 2020

data-man commented Sep 22, 2020 •

edited

Loading

FireFox317 commented Sep 22, 2020 •

edited

Loading

VojtechStep commented Sep 22, 2020

FireFox317 commented Sep 22, 2020

LemonBoy commented Sep 22, 2020

data-man commented Sep 22, 2020

LemonBoy commented Sep 27, 2020

FireFox317 Sep 27, 2020

LemonBoy Sep 27, 2020

data-man commented Sep 27, 2020

Vexu Nov 19, 2020

LemonBoy Nov 19, 2020

Vexu Nov 19, 2020

LemonBoy Nov 19, 2020

std.fmt meets UTF-8 #6390

std.fmt meets UTF-8 #6390

Conversation

LemonBoy commented Sep 21, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

data-man commented Sep 21, 2020

Choose a reason for hiding this comment

data-man commented Sep 22, 2020 • edited Loading

FireFox317 commented Sep 22, 2020 • edited Loading

VojtechStep commented Sep 22, 2020

FireFox317 commented Sep 22, 2020

LemonBoy commented Sep 22, 2020

data-man commented Sep 22, 2020

LemonBoy commented Sep 27, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

data-man commented Sep 27, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

data-man commented Sep 22, 2020 •

edited

Loading

FireFox317 commented Sep 22, 2020 •

edited

Loading