Skip to content

std.fmt meets UTF-8 #6390

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Nov 19, 2020
Merged

std.fmt meets UTF-8 #6390

merged 10 commits into from
Nov 19, 2020

Conversation

LemonBoy
Copy link
Contributor

This is a reboot of #5569 and #3970 with more polish on top of it.
You can still print ASCII chars (a-la printf c specifier) with c, use u for printing unicode codepoints.
You can now print UTF-8 encoded strings with the specified width/alignment.

lib/std/fmt.zig Outdated
if (@typeInfo(@TypeOf(int_value)).Int.bits <= 21) {
return formatUnicodeCodepoint(@as(u21, int_value), options, writer);
} else {
@compileError("Cannot print integer that is larger than 32 bits as an UTF-8 sequence");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This error message should contain "21 bits", not "32 bits"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, I've accidentally copy-pasted the message from the @compileError above

},
if (options.width) |min_width| {
// In case of error assume the buffer content is ASCII-encoded
const width = unicode.utf8CountCodepoints(buf) catch |_| buf.len;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't decide the string width! A codepoint is not a single character. Example:
"👩‍👦‍👦" is U+1F469 U+200D U+1F466 U+200D U+1F466, which has 5 codepoints, but only width 1

You can look that up with this tool: https://cryptii.com/pipes/unicode-lookup

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't decide the string width!

That's a good approximation of the string width, the same approximation used by other PLs.
Entering the wcwidth territory and dealing with tables needing constant updates or mismatches between the producer (Zig, in this case) and the consumer (the terminal emulator/editor/browser) is definitely not something that I'd rank high on my todo list.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Surely it would be nice if a user could put a table in the root source file and std.unicode apis could use it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the best solution is to implement the runtime width specifier (see #1358) and let the user specify the display width, I'm playing with a prototype of this idea and it looks promising.

@data-man
Copy link
Contributor

It would be nice if options.fill would be u21. Unicode has many nice fill symbols.

lib/std/fmt.zig Outdated
if (unicode.utf8ValidCodepoint(c)) {
var buf: [4]u8 = undefined;
// The codepoint is surely valid, hence the use of unreachable
const len = std.unicode.utf8Encode(@truncate(u21, c), &buf) catch |err| switch (err) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

c is already u21

@andrewrk andrewrk self-assigned this Sep 21, 2020
@data-man
Copy link
Contributor

data-man commented Sep 22, 2020

Bad news.

benchmark.zig
const std = @import("std");
const time = std.time;
const Timer = time.Timer;

const count = 1_000_000;

pub fn main() !void {
    const stdout = std.io.getStdOut().writer();

    var buffer: [2048]u8 = undefined;
    var fixed = std.heap.FixedBufferAllocator.init(buffer[0..]);
    const args = try std.process.argsAlloc(&fixed.allocator);

    var i: usize = 1;
    while (i < args.len) : (i += 1) {
        const arg = args[i];
        try stdout.print("Format '{}'\n", .{arg});
        var timer = try Timer.start();
        const start = timer.lap();

        var j: usize = 0;
        while (j < count) : (j += 1) {
            const a = std.fmt.count("{:=^40}", .{arg});
            const b = std.fmt.count("{:=>40}", .{arg});
            const c = std.fmt.count("{:=<40}", .{arg});
        }

        const end = timer.read();

        const elapsed_s = @intToFloat(f64, end - start) / time.ns_per_s;
        const throughput = @floatToInt(u64, @intToFloat(f64, count) / elapsed_s);
        try stdout.print("Throughput: {}\n", .{throughput});
    }
}

$ ./benchmark 123aaaaaaaaaaaaaaaaaaaaaaaaaaa ddddddddddddddaaaaaaaaaaaaaaaaaa
master:

Format '123aaaaaaaaaaaaaaaaaaaaaaaaaaa'
Throughput: 62075612
Format 'ddddddddddddddaaaaaaaaaaaaaaaaaa'
Throughput: 74289364

This PR:

Format '123aaaaaaaaaaaaaaaaaaaaaaaaaaa'
Throughput: 1887236
Format 'ddddddddddddddaaaaaaaaaaaaaaaaaa'
Throughput: 1845342```

@FireFox317
Copy link
Contributor

FireFox317 commented Sep 22, 2020

@data-man Why bad news? This seems to be 30-40 times as fast as the master branch ?!

Edit: Nvm, throughput was printed instead of time

@VojtechStep
Copy link

Not really, it's labeled time, but the value printed is actually the throughput

@FireFox317
Copy link
Contributor

Ahh @VojtechStep, jup you are correct.

@LemonBoy
Copy link
Contributor Author

Bad news.

It was not unexpected, doing more work requires more time :)
I can get the gap down from 40x to 4x by foregoing the codepoint validation, but that's it.

@data-man
Copy link
Contributor

@LemonBoy

It was not unexpected

Of course, the numbers are just for discussion. :)

@LemonBoy
Copy link
Contributor Author

@data-man, last commit should cut the slowdown by a noticeable amount, especially on mostly (or pure) ASCII strings.

@@ -776,7 +795,7 @@ fn testUtf8CountCodepoints() !void {
testing.expectEqual(@as(usize, 10), try utf8CountCodepoints("abcdefghij"));
testing.expectEqual(@as(usize, 10), try utf8CountCodepoints("äåéëþüúíóö"));
testing.expectEqual(@as(usize, 5), try utf8CountCodepoints("こんにちは"));
testing.expectError(error.Utf8EncodesSurrogateHalf, utf8CountCodepoints("\xED\xA0\x80"));
// testing.expectError(error.Utf8EncodesSurrogateHalf, utf8CountCodepoints("\xED\xA0\x80"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commented out code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad, I'll add it back

@data-man
Copy link
Contributor

Nice!
But I suggest to remove width calculation when #6411 will merged.

g-w1 added a commit to g-w1/ezc that referenced this pull request Oct 11, 2020
lib/std/fmt.zig Outdated
Comment on lines 657 to 667
if (unicode.utf8ValidCodepoint(c)) {
var buf: [4]u8 = undefined;
// The codepoint is surely valid, hence the use of unreachable
const len = std.unicode.utf8Encode(c, &buf) catch |err| switch (err) {
error.Utf8CannotEncodeSurrogateHalf, error.CodepointTooLarge => unreachable,
};
return formatBuf(buf[0..len], options, writer);
}

// In case of error output the replacement char U+FFFD
return formatBuf(&[_]u8{ 0xef, 0xbf, 0xbd }, options, writer);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just output the replacement char if utf8Encode returns an error?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm? 0xef, 0xbf, 0xbd is the UTF-8 encoded replacement char

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant that utf8Encode returns an error for invalid input so there should be no need to validate before it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh right, GH won't let me see the lines around this.
Yes that's better, this check must be a leftover for some error-free utf8Encode alternative I was toying with.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants