-
-
Notifications
You must be signed in to change notification settings - Fork 2.8k
std.fmt meets UTF-8 #6390
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
std.fmt meets UTF-8 #6390
Conversation
lib/std/fmt.zig
Outdated
if (@typeInfo(@TypeOf(int_value)).Int.bits <= 21) { | ||
return formatUnicodeCodepoint(@as(u21, int_value), options, writer); | ||
} else { | ||
@compileError("Cannot print integer that is larger than 32 bits as an UTF-8 sequence"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This error message should contain "21 bits", not "32 bits"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops, I've accidentally copy-pasted the message from the @compileError
above
}, | ||
if (options.width) |min_width| { | ||
// In case of error assume the buffer content is ASCII-encoded | ||
const width = unicode.utf8CountCodepoints(buf) catch |_| buf.len; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This won't decide the string width! A codepoint is not a single character. Example:
"👩👦👦"
is U+1F469 U+200D U+1F466 U+200D U+1F466, which has 5 codepoints, but only width 1
You can look that up with this tool: https://cryptii.com/pipes/unicode-lookup
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This won't decide the string width!
That's a good approximation of the string width, the same approximation used by other PLs.
Entering the wcwidth
territory and dealing with tables needing constant updates or mismatches between the producer (Zig, in this case) and the consumer (the terminal emulator/editor/browser) is definitely not something that I'd rank high on my todo list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Surely it would be nice if a user could put a table in the root source file and std.unicode apis could use it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the best solution is to implement the runtime width specifier (see #1358) and let the user specify the display width, I'm playing with a prototype of this idea and it looks promising.
It would be nice if |
lib/std/fmt.zig
Outdated
if (unicode.utf8ValidCodepoint(c)) { | ||
var buf: [4]u8 = undefined; | ||
// The codepoint is surely valid, hence the use of unreachable | ||
const len = std.unicode.utf8Encode(@truncate(u21, c), &buf) catch |err| switch (err) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
c is already u21
Bad news. benchmark.zigconst std = @import("std");
const time = std.time;
const Timer = time.Timer;
const count = 1_000_000;
pub fn main() !void {
const stdout = std.io.getStdOut().writer();
var buffer: [2048]u8 = undefined;
var fixed = std.heap.FixedBufferAllocator.init(buffer[0..]);
const args = try std.process.argsAlloc(&fixed.allocator);
var i: usize = 1;
while (i < args.len) : (i += 1) {
const arg = args[i];
try stdout.print("Format '{}'\n", .{arg});
var timer = try Timer.start();
const start = timer.lap();
var j: usize = 0;
while (j < count) : (j += 1) {
const a = std.fmt.count("{:=^40}", .{arg});
const b = std.fmt.count("{:=>40}", .{arg});
const c = std.fmt.count("{:=<40}", .{arg});
}
const end = timer.read();
const elapsed_s = @intToFloat(f64, end - start) / time.ns_per_s;
const throughput = @floatToInt(u64, @intToFloat(f64, count) / elapsed_s);
try stdout.print("Throughput: {}\n", .{throughput});
}
}
This PR:
|
@data-man Why bad news? This seems to be 30-40 times as fast as the master branch ?! Edit: Nvm, throughput was printed instead of time |
Not really, it's labeled time, but the value printed is actually the throughput |
Ahh @VojtechStep, jup you are correct. |
It was not unexpected, doing more work requires more time :) |
Of course, the numbers are just for discussion. :) |
@data-man, last commit should cut the slowdown by a noticeable amount, especially on mostly (or pure) ASCII strings. |
@@ -776,7 +795,7 @@ fn testUtf8CountCodepoints() !void { | |||
testing.expectEqual(@as(usize, 10), try utf8CountCodepoints("abcdefghij")); | |||
testing.expectEqual(@as(usize, 10), try utf8CountCodepoints("äåéëþüúíóö")); | |||
testing.expectEqual(@as(usize, 5), try utf8CountCodepoints("こんにちは")); | |||
testing.expectError(error.Utf8EncodesSurrogateHalf, utf8CountCodepoints("\xED\xA0\x80")); | |||
// testing.expectError(error.Utf8EncodesSurrogateHalf, utf8CountCodepoints("\xED\xA0\x80")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commented out code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My bad, I'll add it back
Nice! |
Make the code easier for the optimizer to work with and introduce a fast path for ASCII sequences. Introduce a benchmark harness to start tracking the performance of ops on utf8.
lib/std/fmt.zig
Outdated
if (unicode.utf8ValidCodepoint(c)) { | ||
var buf: [4]u8 = undefined; | ||
// The codepoint is surely valid, hence the use of unreachable | ||
const len = std.unicode.utf8Encode(c, &buf) catch |err| switch (err) { | ||
error.Utf8CannotEncodeSurrogateHalf, error.CodepointTooLarge => unreachable, | ||
}; | ||
return formatBuf(buf[0..len], options, writer); | ||
} | ||
|
||
// In case of error output the replacement char U+FFFD | ||
return formatBuf(&[_]u8{ 0xef, 0xbf, 0xbd }, options, writer); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just output the replacement char if utf8Encode
returns an error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm? 0xef, 0xbf, 0xbd
is the UTF-8 encoded replacement char
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant that utf8Encode
returns an error for invalid input so there should be no need to validate before it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh right, GH won't let me see the lines around this.
Yes that's better, this check must be a leftover for some error-free utf8Encode
alternative I was toying with.
This is a reboot of #5569 and #3970 with more polish on top of it.
You can still print ASCII chars (a-la
printf
c
specifier) withc
, useu
for printing unicode codepoints.You can now print UTF-8 encoded strings with the specified width/alignment.