Skip to content

std.fmt: Clarify that width is measured in Unicode Codepoints. #18536

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

davideger
Copy link
Contributor

According to the current implementation of formatBuf(), we measure the "number of characters" taken up by a slice given for rendering using unicode.utf8CountCodepoints. Currently the fill character is a single ASCII byte. Hence, "width" today means number of unicode codepoints.

Given more advanced terminals like ghostty it's arguable we might want to count grapheme clusters when providing width and alignment, but then that would bring much heaviness in the form of a library like ziglyph into a very core part of zig, so probably not. Better to just say what we're doing today.

Copy link
Member

@andrewrk andrewrk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what it should say. There is no plan for the standard library to support Unicode, and dividing things based on unicode codepoints is worse than leaving things encoded as bytes.

If the implementation disagrees with this, the implementation is wrong.

@@ -42,7 +42,7 @@ pub const FormatOptions = struct {
/// - *specifier* is a type-dependent formatting option that determines how a type should formatted (see below)
/// - *fill* is a single character which is used to pad the formatted text
/// - *alignment* is one of the three characters `<`, `^`, or `>` to make the text left-, center-, or right-aligned, respectively
/// - *width* is the total width of the field in characters
/// - *width* is the total width of the field in "characters" (unicode codepoints)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// - *width* is the total width of the field in "characters" (unicode codepoints)
/// - *width* is the total width of the field in bytes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, a change just went in to allow any unicode codepoint to be used for the fill "character" ( 279607c ) is that wrong too?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That one is a bit tricky. It's a counter-intuitive UI but it's technically OK since the implementation does not need to be Unicode-aware to use an arbitrary sequence of bytes as a fill character. It may as well be fill_bytes: []const u8 and the implementation assumes that all those bytes are to be treated as one width unit. However, it's not worth having that field be a reference to external memory, so having it be a fixed size integer is worth the limitation. It's similar rational to Zig's character literals, which are comptime_int and support any single Unicode codepoint, but do not for example support 👨‍👩‍👧‍👦 which is 4 codepoints joined with 3 Zero Width Join codepoints, because the purpose of a character literal is to be an integer.

This kind of unfortunate complexity (the fact that there is not a single integer corresponding to every Unicode character) is one reason I have no intention for Zig to depend on the large amount of volatile data needed to keep up with Unicode.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Points up how the term "character" is too ambiguous -- Unicode itself doesn't define it for good reason. The example of 👨‍👩‍👧‍👦 is what's more technically termed a grapheme cluster (that this looks like a single "character" here is entirely dependent on the font and the display context (web browser).
The zig term "character literal" trips some people up, because it's actually a "Unicode code point" literal. it would be nice to transitions discussion and the docs to use this term, even if it departs from the "C" terminology. Lots of folks wish for a "character cell" model for text formatting, but this always falls apart in the face of combining characters, worldwide text, fonts, and rendering technology. These is well beyond the scope of the standard library. What's most often of concern when writing format-to-buffer is the storage for the data, so stick to bytes for sizes and return values that give you resulting sizes of things. The fill quantity perhaps should not be bytes or characters, but a count of repetitions of the fill codepoint. Even if you have a Unicode character database, that is not sufficient in general for text layout. Counts of Unicode codepoints are in general not useful, and tends to encourage the wrong mental model of worldwide (Unicode) text.

Andrew I think has drawn just the right lines of compromise for fmt functionality.

@Vexu
Copy link
Member

Vexu commented Jan 14, 2024

dividing things based on unicode codepoints is worse than leaving things encoded as bytes.

If the implementation disagrees with this, the implementation is wrong.

#6390 (comment)

@andrewrk
Copy link
Member

andrewrk commented Jan 14, 2024

@Vexu I think that entire PR needs to get reverted. I merged it because I trusted @LemonBoy's stewardship over std.fmt at the time, however, now that std.fmt has fallen back into my hands, I don't like what I see. It was a mistake to deviate from dealing purely in encoded bytes. I've explained this time and time again, and I'm sorry for not putting my foot down also in this instance. I also do not care for the way format() functions are used, particularly, we used to be able to do hex printing with {x} and bytes printing with {B} and {Bi} which was magnificent. Now we have to import poorly named format helper functions and track down where a format() function is implemented, which starts to approach the annoyance of finding a function definition in C++.

Furthermore, a single prefix escape character (%) is simpler, easier to parse, and already has precedent, compared to {}.
Oh yeah, and our existing formatted printing implementation unnecessarily creates a bunch of binary bloat.

It's unfortunate that std.fmt is probably one of the most widely depended on APIs of the standard library, given that it's in my opinion one of the worst pieces of software that Zig has to offer.

It's not one person's fault. Or if it is, it's my fault. It's basically just been a free-for-all with miscellaneous people contributing to it in order to hackily solve their one problem, but lacking a unifying vision. However, formatted printing is one of the most important things a standard library has to offer. It's a shame how little care and attention it has been given, considering its importance.

@andrewrk andrewrk added the docs label Jan 14, 2024
@xdBronch
Copy link
Contributor

Furthermore, a single prefix escape character (%) is simpler, easier to parse, and already has precedent, compared to {}.

i agree with everything except this. imo % makes it much harder to tell what is part of the formatting and whats not since it simply ends when it runs out of valid options rather than a delimiter. i cant think of any reasonable way to differentiate between something that i just want to print after the format argument and an invalid format option. also already has precedent? this seems like very backwards logic in the context of zig. you and others that work on zig are constantly pushing the language forward by purposefully ignoring the things that have precedent.

@andrewrk
Copy link
Member

andrewrk commented Jan 14, 2024

If there's a good reason to depart from status quo, then by all means, the point of Zig is to revisit such things. But there's a reason that Zig uses familiar keywords, familiar syntax, familiar imperative control flow, etc. When the precedents are good, there is a lot of value in being the same because it reduces the language's learning curve.

I honestly think that, apart from requiring half of the implementation to be in the compiler, and the security vulnerabilities, formatted printing is one of the most well-designed, elegant things about the C programming language. It works remarkably well while accomplishing both minimal runtime bloat and satisfactory performance.

@xdBronch
Copy link
Contributor

i just find {} much more intuitive than %. its instantly clear when the formatting begins and ends as well as allowing more advanced options such as named and indexed arguments that the current std.fmt can do. i dont see a way that can be achieved with a C style format string that isnt awful to use. i think the syntax is the least offensive part of zigs formatting

@mlugg
Copy link
Member

mlugg commented Jan 15, 2024

Okay, so I have opinions on formatted printing which deviate quite a lot from Andrew's I believe. This seems as good a place as any, so let me spell them out and justify them.

Firstly, I definitely want to voice my support for {} or some other open-close format signifier. When I first started using Zig, I actually strongly disliked it, but I've swung around hard on this; after I got over initial confusion, I realised just how hard I find C format specifiers to read when there isn't surrounding whitespace. Zig's specifiers, OTOH, are much more flexible: it's obvious at a glance (even if you don't know the specifier format!) where it starts and ends, and specifiers can be arbitrarily long and complex since the entire string can be easily plucked out. I do recognise that C-style specifiers are easier to parse, but they're much harder to read when not followed by whitespace, and are significantly less versatile. I'm not wedded to braces specifically - I wouldn't mind if it were [] or something - but I think having a clear start and end to a specifier is hugely beneficial.

Regarding binary bloat, I can't say for sure, but I'd be very surprised if this couldn't be trivially resolved with some sensible uses of inline. I believe you've previously spoken about validating format strings at comptime but re-parsing for calling the printing logic at runtime: if I'm coreect, then I think that this is, to be blunt, an absolutely awful idea, and one of the weakest things about C printf. If we inline the calls correctly, then a formatted printing call just expands to maybe a half dozen standard function calls. That isn't significant bloat - it's at most 2n+1 function calls for n specifiers. The bad thing about the current system IMO is that every one of these calls generates an actual instantiation - meaning if the optimizer doesn't choose to inline it, we could pay the cost of a function prologue and epilogue, as well as obeying a calling convention on either end (probably involving stuffing everything into a tuple on the stack!). I don't have the numbers to back this up, but that seems like a way bigger concern than the calls to deeper logic actually existing. At least, it doesn't feel like a good reason to allow random metaprogramming-ey strings to leak into the application and add pointless runtime logic to formatted printing when it can be trivially lowered to a small amount of fixed calls. Perhaps actually doing this inlining would prove me wrong, but I would be honestly surprised. I wouldn't mind passing individual specifiers at runtime if the instantions of individual format functions produce bloat - perhaps the format function itself makes the choice on whether the parameter be marked comptime based on whether it expects to be called with a variety of format strings and whether it helps for it to be comptime-known - but I wouldn't expect there to be any benefit of going past that point.

I don't have any strong opinions on {B} etc as specifiers, but I honestly don't understand the distaste for formatter functions. If you're using a debugger, they're at the top of your call stack, just like for a normal specifier. If you're reading code, formatter functions are way easier to trace: you can see in the args tuple exactly what function creates the formatter, so in two steps you get to the formatter function, whereas for built-in specifiers you might have to actually delve into the std.fmt implementation. "Tracking down" where it is defined doesn't make sense to me as a complaint - you just as equally have to "track down" documentation for specifiers. Issues of documentation will persist no matter what is being documented. I wouldn't mind if we introduced specifiers for some semi-common non-trivial printing actions, but I really don't think it's a big usability issue.

On the general form of our specifiers, I have no strong comment, because I don't use non-trivial formatted printing often enough. However, I would like to see better support and documentation for how specifiers are "inherited" through different types (e.g. printing a struct by specifying field formats in one format string). This isn't a big deal, but it's definitely something I've wanted more than once.

One thing I absolutely do agree with: we should deal in bytes, not in anything Unicode. Whilst Unicode and UTF-8 are great creations, and Zig has vague opinionation towards UTF-8 in general, that's more about source encoding etc, where we need to make such a choice. It is not useful for this opinionation to be too viral, and Unicode implementations are complex so in general should be kept away from commonly used APIs (both for reasons of binary bloat and for reasons of speed).

@andrewrk
Copy link
Member

Regarding binary bloat, I can't say for sure, but I'd be very surprised if this couldn't be trivially resolved with some sensible uses of inline.

Regarding this point specifically, is it something that can be explored without breaking changes? Because that sounds like a hugely beneficial enhancement if it is possible simply by making format() inline. The two concerns with this are binary bloat and compilation speed.

@xdBronch
Copy link
Contributor

related #9635
this mentioned the use of inline in certain places helping

@andrewrk andrewrk closed this in 2d9c479 Jan 22, 2024
bilaliscarioth pushed a commit to bilaliscarioth/zig that referenced this pull request Jan 27, 2024
Currently, std.fmt has a misguided, half-assed Unicode implementation
with an ambiguous definition of the word "character". This commit does
almost nothing to mitigate the problem, but it lets me close an open PR.

In the future I will revert 473cb1f as
well as 279607c, and redo the whole
std.fmt API, breaking everyone's code and unfortunately causing nearly
every Zig user to have a bad day. std.fmt will go back to only dealing
in bytes, with zero Unicode awareness whatsoever. I suggest a third
party package provide Unicode functionality as well as a more advanced
text formatting function for when Unicode awareness is needed. I have
always suggested this, and I sincerely apologize for merging pull
requests that compromised my stance on this matter.

Most applications should, instead, strive to make their code independent
of Unicode, dealing strictly in encoded UTF-8 bytes, and never attempt
operations such as: substring manipulation, capitalization, alignment,
word replacement, or column number calculations.

Exceptions to this include web browsers, GUI toolkits, and terminals. If
you're not making one of these, any dependency on Unicode is probably a
bug or worse, a poor design decision.

closes ziglang#18536
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants