-
-
Notifications
You must be signed in to change notification settings - Fork 2.9k
std.fmt: Clarify that width is measured in Unicode Codepoints. #18536
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what it should say. There is no plan for the standard library to support Unicode, and dividing things based on unicode codepoints is worse than leaving things encoded as bytes.
If the implementation disagrees with this, the implementation is wrong.
@@ -42,7 +42,7 @@ pub const FormatOptions = struct { | |||
/// - *specifier* is a type-dependent formatting option that determines how a type should formatted (see below) | |||
/// - *fill* is a single character which is used to pad the formatted text | |||
/// - *alignment* is one of the three characters `<`, `^`, or `>` to make the text left-, center-, or right-aligned, respectively | |||
/// - *width* is the total width of the field in characters | |||
/// - *width* is the total width of the field in "characters" (unicode codepoints) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// - *width* is the total width of the field in "characters" (unicode codepoints) | |
/// - *width* is the total width of the field in bytes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, a change just went in to allow any unicode codepoint to be used for the fill "character" ( 279607c ) is that wrong too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That one is a bit tricky. It's a counter-intuitive UI but it's technically OK since the implementation does not need to be Unicode-aware to use an arbitrary sequence of bytes as a fill character. It may as well be fill_bytes: []const u8
and the implementation assumes that all those bytes are to be treated as one width unit. However, it's not worth having that field be a reference to external memory, so having it be a fixed size integer is worth the limitation. It's similar rational to Zig's character literals, which are comptime_int
and support any single Unicode codepoint, but do not for example support 👨👩👧👦 which is 4 codepoints joined with 3 Zero Width Join codepoints, because the purpose of a character literal is to be an integer.
This kind of unfortunate complexity (the fact that there is not a single integer corresponding to every Unicode character) is one reason I have no intention for Zig to depend on the large amount of volatile data needed to keep up with Unicode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Points up how the term "character" is too ambiguous -- Unicode itself doesn't define it for good reason. The example of 👨👩👧👦 is what's more technically termed a grapheme cluster (that this looks like a single "character" here is entirely dependent on the font and the display context (web browser).
The zig term "character literal" trips some people up, because it's actually a "Unicode code point" literal. it would be nice to transitions discussion and the docs to use this term, even if it departs from the "C" terminology. Lots of folks wish for a "character cell" model for text formatting, but this always falls apart in the face of combining characters, worldwide text, fonts, and rendering technology. These is well beyond the scope of the standard library. What's most often of concern when writing format-to-buffer is the storage for the data, so stick to bytes for sizes and return values that give you resulting sizes of things. The fill quantity perhaps should not be bytes or characters, but a count of repetitions of the fill codepoint. Even if you have a Unicode character database, that is not sufficient in general for text layout. Counts of Unicode codepoints are in general not useful, and tends to encourage the wrong mental model of worldwide (Unicode) text.
Andrew I think has drawn just the right lines of compromise for fmt functionality.
|
@Vexu I think that entire PR needs to get reverted. I merged it because I trusted @LemonBoy's stewardship over std.fmt at the time, however, now that std.fmt has fallen back into my hands, I don't like what I see. It was a mistake to deviate from dealing purely in encoded bytes. I've explained this time and time again, and I'm sorry for not putting my foot down also in this instance. I also do not care for the way format() functions are used, particularly, we used to be able to do hex printing with {x} and bytes printing with {B} and {Bi} which was magnificent. Now we have to import poorly named format helper functions and track down where a format() function is implemented, which starts to approach the annoyance of finding a function definition in C++. Furthermore, a single prefix escape character ( It's unfortunate that It's not one person's fault. Or if it is, it's my fault. It's basically just been a free-for-all with miscellaneous people contributing to it in order to hackily solve their one problem, but lacking a unifying vision. However, formatted printing is one of the most important things a standard library has to offer. It's a shame how little care and attention it has been given, considering its importance. |
i agree with everything except this. imo |
If there's a good reason to depart from status quo, then by all means, the point of Zig is to revisit such things. But there's a reason that Zig uses familiar keywords, familiar syntax, familiar imperative control flow, etc. When the precedents are good, there is a lot of value in being the same because it reduces the language's learning curve. I honestly think that, apart from requiring half of the implementation to be in the compiler, and the security vulnerabilities, formatted printing is one of the most well-designed, elegant things about the C programming language. It works remarkably well while accomplishing both minimal runtime bloat and satisfactory performance. |
i just find |
Okay, so I have opinions on formatted printing which deviate quite a lot from Andrew's I believe. This seems as good a place as any, so let me spell them out and justify them. Firstly, I definitely want to voice my support for Regarding binary bloat, I can't say for sure, but I'd be very surprised if this couldn't be trivially resolved with some sensible uses of I don't have any strong opinions on On the general form of our specifiers, I have no strong comment, because I don't use non-trivial formatted printing often enough. However, I would like to see better support and documentation for how specifiers are "inherited" through different types (e.g. printing a struct by specifying field formats in one format string). This isn't a big deal, but it's definitely something I've wanted more than once. One thing I absolutely do agree with: we should deal in bytes, not in anything Unicode. Whilst Unicode and UTF-8 are great creations, and Zig has vague opinionation towards UTF-8 in general, that's more about source encoding etc, where we need to make such a choice. It is not useful for this opinionation to be too viral, and Unicode implementations are complex so in general should be kept away from commonly used APIs (both for reasons of binary bloat and for reasons of speed). |
Regarding this point specifically, is it something that can be explored without breaking changes? Because that sounds like a hugely beneficial enhancement if it is possible simply by making format() inline. The two concerns with this are binary bloat and compilation speed. |
related #9635 |
Currently, std.fmt has a misguided, half-assed Unicode implementation with an ambiguous definition of the word "character". This commit does almost nothing to mitigate the problem, but it lets me close an open PR. In the future I will revert 473cb1f as well as 279607c, and redo the whole std.fmt API, breaking everyone's code and unfortunately causing nearly every Zig user to have a bad day. std.fmt will go back to only dealing in bytes, with zero Unicode awareness whatsoever. I suggest a third party package provide Unicode functionality as well as a more advanced text formatting function for when Unicode awareness is needed. I have always suggested this, and I sincerely apologize for merging pull requests that compromised my stance on this matter. Most applications should, instead, strive to make their code independent of Unicode, dealing strictly in encoded UTF-8 bytes, and never attempt operations such as: substring manipulation, capitalization, alignment, word replacement, or column number calculations. Exceptions to this include web browsers, GUI toolkits, and terminals. If you're not making one of these, any dependency on Unicode is probably a bug or worse, a poor design decision. closes ziglang#18536
According to the current implementation of
formatBuf()
, we measure the "number of characters" taken up by a slice given for rendering usingunicode.utf8CountCodepoints
. Currently the fillcharacter
is a single ASCII byte. Hence, "width" today means number of unicode codepoints.Given more advanced terminals like ghostty it's arguable we might want to count grapheme clusters when providing width and alignment, but then that would bring much heaviness in the form of a library like ziglyph into a very core part of zig, so probably not. Better to just say what we're doing today.