-
-
Notifications
You must be signed in to change notification settings - Fork 2.9k
std.fmt: Improve numeric options, simplify custom formatters, reduce complexity and more #20152
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
This comment was marked as off-topic.
This comment was marked as off-topic.
Something to consider: If alignment is removed from numeric formatters, how does one achieve decimal alignment?
Decimal alignment is common in monetary domains of application. |
This is a good proposal. The right side of the
Rather than special-casing a few prefixes, a general solution would be to allow one level of nesting. So for e.g. a right-aligned hex column of width sixteen, It's better to keep
It doesn't make sense for a UTF-8 formatting library to not be UTF-8 aware. It's producing a UTF-8 string.
Constrained embedded targets are the ones who should be using a powerful embedded-aware formatting library. Limiting the Zig standard library to what can fit on an ESP32 is Procrustean. Code in standard should be as small as it can be while providing its function, but no smaller. Embedded targets have always called for distinct libraries which fit their resource-constrained nature. |
While this is true,
Now im wondering if fmt couldnt somehow be made generic over the character type, with options for utf16 or ascii etc, as long as each type can provide some form of character iteration to find the format specifiers etc. I definitely agree that the default utf8 fmt should allow a utf8 fill character though. |
I think it's most reasonable for Zig Sentiment is very clear that plain text data should be UTF-8, and other encodings should be treated as legacy, and using them as a transfer format, or for data at rest, should be deprecated. HMTL illustrates that clearly. I'd much rather have more and better UTF-8 features, over something agnostic about encoding, which duplicates everything, giving a chance for bugs in each separate implementation (which must then be tested, and maintained). Converting where necessary is a simple algorithm, and a solved problem. I firmly agree that, all else equal, tighter codegen is better than the alternative. But I don't think that "no features in this proposal can lead to larger object code" is a good precept. Features should be considered on a bang-for-buck basis, making them optimal is a separate concern. The heavy use of comptime should mean that function specializations don't pay for features they don't use, where that's not true, it's something to work on for sure. |
I missed this the first time through, and wanted to say that using a runtime width is not an especially obscure feature. Calculating the maximum width of a column, and aligning every printed value to that width, is pretty basic stuff. I wouldn't expect precision to get as much use as width, but it's imaginable for e.g. significant figures to be determined at runtime. I've done that once, actually, but have lost count of the number of times I've calculated a column width on the basis of data. I see no advantage to having runtime fill or alignment though, those aren't really decisions which are plausibly made on the basis of data. |
Replying to some comments in bulk:
I don't think that issue is related as it concerns
For your example you would use
you will probably need to bring your own formatter.
Nested placeholders might be interesting to explore (though personally I think this is another use case better handled by custom formatters instead of being built-in into the format placeholder syntax) but how do you envision this solving the problem with signed prefixed numbers, where the sign should go before the prefix? If I might write up a sub-proposal for formatting options for the built-in numeric types later, but I believe something fairly simple like
I encourage you to read the commit message for 2d9c479. I don't personally think it would be some sort of great sin if And just to clarify, as for the
I'm not suggesting removing runtime options, rather I'm suggesting removing runtime options from the format placeholder syntax itself, to keep the syntax and core As an aside, if something like std.log.debug("{}, your new score is {}!", .{
std.fmt.fmtString(name),
std.fmt.fmtNumber(score, .{ .base = .decimal, .precision = 3 }),
}); just to format numbers and strings. |
Something like this should work, although it doesn't: test "optional signed hex" {
const d = -5;
const sign = if (@abs(d) == d) null else '-';
std.debug.print("optionally-signed hex: {?u}0x{x}\n", .{ sign, @abs(d) });
} You can't provide a type specifier after There's a semantic wrinkle here, because I also think that signed Not that it matters if we can provide a general mechanism which supports it.
I firmly disagree. This kind of chauvinism has no place in a modern language. You're only able to hold this opinion because a) you're a monoglot English speaker and b) have decided that you don't care about the needs of the vast majority of people on the planet.
You want to support negative My position is simple: Zig has Unicode literals, like I have in fact read the comments in #18536, including this part:
Which is what you want to do to anyone who doesn't share your fixation on the printable one byte UTF-8 characters! It's an absurdity. As an aside, I do agree that measuring width in codepoints is a mistake, for a number of reasons. But that's a bit off topic. Read this part more carefully, while you're at it:
Encoded bytes, not the small handful of raw bytes which happen to be mostly adequate to expressing your language and your language alone! I happen to agree that there's room for improvement in how Zig handles literal Unicode characters, but that's off topic for this proposal. I'm actively working on one which will cover it, but I want that to be supported with an implementation and some benchmarks, so it isn't ready yet. This is the general problem with opening a poorly focused issue like this. It implies that every one of your ideas have to be implemented, or none of them. The numeric options idea is a good one. Your crusade against Unicode is deeply misguided. I think you should remove the entire Unicode question from this issue, and focus on alignment and improving numeric formatting. If you want to make a case for removing |
@mnemnion I appreciate that you are participating in the discussion, offering differing views and raising good points. I don't appreciate that you are implying that my point of view is misguided by suggesting that I lack technical experience with character encodings or by falsely stating that I'm a monolingual anglophone, and such comments are greatly diminishing my willingness to engage with your replies. English is not even my native language. I read and write multiple different languages at varying levels of proficiency, some of which don't even use the Latin alphabet at all. For as long as I can remember I've regularly run into limitations (or worse, bugs) in systems because my full legal name, which is sometimes used as the basis for usernames, contains non-ASCII characters. Despite having experienced such problems first-hand, I still personally lean on the side of the belief that the core of |
I think Unicode support should be built in: Grapheme Cluster is the proper unit of working with strings for anyone who wants to do anything with a UI, e.g.:
Example: const std = @import("std");
pub fn main() !void {
const examples = [_][]const u8{
"abcde",
"\u{006E}\u{0303}",
"\u{0001F3F3}\u{FE0F}\u{200D}\u{0001F308}",
"ห์",
"ปีเตอร์",
"fghij",
"klmno",
};
const stdout = std.io.getStdOut().writer();
try stdout.print("0123456789\n", .{});
for (examples) |example| {
try stdout.print("{s: >10}\n", .{example});
}
}
Expected Output:
Actual Output:
While I do agree that not adding Unicode support makes To produce the expected output the code must be modified like this: const std = @import("std");
const grapheme = @import("grapheme");
pub fn main() !void {
const examples = [_][]const u8{
"abcde",
"\u{006E}\u{0303}",
"\u{0001F3F3}\u{FE0F}\u{200D}\u{0001F308}",
"ห์",
"ปีเตอร์",
"fghij",
"klmno",
};
const stdout = std.io.getStdOut().writer();
try stdout.print("0123456789\n", .{});
for (examples) |example| {
var gpa = std.heap.GeneralPurposeAllocator(.{}){};
const allocator = gpa.allocator();
const gd = try grapheme.GraphemeData.init(allocator);
defer gd.deinit();
var iter = grapheme.Iterator.init(example, &gd);
var len: usize = 0;
while (iter.next()) |_| : (len += 1) {}
//try stdout.print("len: {}\n", .{len});
const num_padding = 10;
if (len < num_padding) {
const padding = num_padding - len;
//try stdout.print("pad: {}\n ", .{padding});
var i: usize = 0;
while (i < padding) : (i += 1) {
try stdout.print(" ", .{});
}
}
try stdout.print("{s}\n", .{example});
}
} Output:
Basically, everything is Unicode. I don't understand why you would not give people the power to process Unicode strings in a way where they can easily work with grapheme clusters (units of display width one). I strongly believe working with Unicode should be as simple as possible, because otherwise people just won't (properly) support it in their apps, because they don't bother dealing with its complexity. So yes, I agree that complexity should be reduced, but for the people working with these things, not those providing them. |
Grapheme clusters change with each release of unicode, and are useless for practical purposes
Not all characters are the same width; and not all fonts pick the same width.
In addition to the above reasons, you have no way to know this with a non-fixed width font.
This is unrelated. |
Uh oh!
There was an error while loading. Please reload this page.
I have had some ideas for
std.fmt
for a while now but I've been having trouble figuring out how to present them as concrete proposals. The following is an attempt at summarizing a few of them:Summary of current problems
std
fail at this task).std.fmt
.std.fmt
should deal in raw bytes only but some parts of it currently deal in Unicode scalar values/UTF-8 sequences.Summary of proposed solutions
d
ore
.0x
prefixes (among other), to make it easier to format numbers correctly for tasks such as pretty-printing or code generation.options: std.fmt.FormatOptions
parameter from customformat
formatter functions and instead process alignment generically, separately from formatting through clever use ofstd.io.countingWriter
, to make it much easier for users to correctly implement custom formatters.u
specifier (better handled by a formatter from thestd.unicode
namespace), clarify that thes
andc
specifiers output bytes verbatim, redefinefill
to be a literal byte and redefinewidth
to be in bytes.In more detail
Numeric formatting options are too limited
(Related: #14436, #19488)
Currently, the only available numeric formatting option is
precision
, which controls the precision of and number of digits in the fractional part of a floating-point number. We can quickly think of a few other properties a user might want to control when formatting a number:The minimum number of digits the number should be zero-padded to.
There is currently no way to correctly pad a number with leading zeroes such that the sign is written in the correct place. Code like
std.debug.print("{:0>5}\n", .{-123})
prints0-123
instead of the expected-0123
or-00123
.Whether a positive number should have a plus sign.
It is currently possible to format a positive integer with a plus sign, but how to do this is very obscure: You need to specify a
width
of 1 or greater, and the integer type must be signed.std.debug.print("{d:1}\n", .{@as(i32, 123)})
prints+123
. Unsigned integers and floating-point numbers, however, can currently not be formatted with a plus sign.Whether a hexadecimal number should be prefixed with
0x
.It is currently not possible to format hexadecimal (or binary or octal) numbers with a prefix. In some situations you may be able to get around this by using a placeholder string like
0x{x}
, but this will not work for negative values (-77
would produce0x-4d
) or when left-padding is involved.There may be other options worth considering, but this should hopefully demonstrate that just
precision
is probably not enough and thatwidth
is not a suitable substitute for zero-padding.Which leads us to the next point...
Numeric formatting options are conflated with alignment options
The four
std.fmt.FormatOptions
arefill
,alignment
,width
andprecision
. This is a bit strange, because the first three deal with alignment and are always applicable no matter which type of value is being formatted, whileprecision
is only relevant for numbers.Numeric formatting (positive sign, leading zeroes, etc.) is a different concern from alignment; "zero-pad a number to a minimum number of digits" is different from "right-align a string to a minimum width by left-padding with the character
0
" and there is currently no way for users to simultaneously zero-pad and right-align a number.It is also a bit funny that a nonsensical
precision
option used with a non-numeric specifier like in{s:.3}
is not an error.I think it would make sense to break out
precision
and other future numeric formatting options from the generic alignment options specified after the:
and instead make them part of the base specifiers (e.g.d
ore
) themselves. In other words, today's placeholder{d: >10.3}
might become{d.3: >10}
.This also opens up the door for specifier-specific options; for example, an option specifying whether to prefix the number with
0x
makes sense forx
(similarly forb
oro
) but not ford
and should be a compile error for the latter.With
std.fmt.FormatOptions
reduced to only the three alignment-related options, we can move on to the third point...It is difficult to implement custom
format
formatter functions correctlyCustom
format
formatter function currently have the following signature:The
options
parameter of typestd.fmt.FormatOptions
specifies the fill character, alignment, minimum width and numeric precision, corresponding to the options passed after the colon in the placeholder string.{:_>9.3}
is parsed as.{ .fill = '_', .alignment = .right, .width = 9, .precision = 3 }
.The problem is, most custom formatters (both in
std
and in external packages) completely ignore these options:One could argue that the onus is on the custom formatters to correctly implement padding and that it is a bug that formatters like
fmtSliceHexLower
orSemanticVersion.format
don't handle padding.I will instead point out that padding could be trivially handled in the main
std.fmt.format
function, without burdening custom formatters with the task of implementing it, simply by writing in two passes; first to astd.io.countingWriter(std.io.null_writer)
to determine the width of the unpadded string, then again to the real writer, padding the difference on either side as needed. Left-alignment only requires a single pass to astd.io.countingWriter(writer)
.With
fill
,alignment
andwidth
handled generically, the remaining option would beprecision
. But with that one removed by the above sub-proposal, we are left with no options and can remove thestd.fmt.FormatOptions
parameter, simplify theformat
signature towhich makes it much easier for users to implement correctly.
(As a side note, the
fmt
argument here should really be renamedspecifier
orspec
so that it doesn't get mixed up with thefmt
string itself.)Remove named placeholder options
Did you know that the following is possible?
That's correct; certain placeholder options like
width
andprecision
don't have to be specified literally but can also be resolved at runtime by specifying the name of a field ofargs
.This is a fairly obscure feature which increases the overall complexity of
std.fmt
. It is also limited to onlywidth
andprecision
; other options likefill
oralignment
must be specified literally and can not be resolved at runtime.Instead of putting all of this complexity in the parsing and handling of the placeholder string itself, runtime control of formatting options is probably better handled by custom formatters, which are not only more flexible but also make the intent of such code more immediately visible and explicit to readers. To help users with the task of runtime-controlled aligned formatting, the
std.fmt
namespace could expose a formatter function for this purpose.Remove any notion of Unicode-awareness from
std.fmt
(Related: #18536 (comment), 2d9c479, #234)
Simple:
std.fmt
should not be Unicode-aware and should deal in raw bytes only, for simplicity. Therefore,u
"formatu21
as UTF-8 sequence" specifier should be removed (better handled by a formatter fromstd.unicode
),s
andc
specifiers should clarify that they output (sequences of) bytes verbatim, without any sort of replacement or transformation,width
placeholder option should clarify that it controls the minimum width in bytes (not code points, grapheme clusters or some other unit of measure), andfill
placeholder option should clarify that it is a literal byte repeated verbatim to pad out the string.Applications that need powerful Unicode-aware formatting should use a different third-party package.
Other considerations
std.fmt
currently generates a lot of code which is undesirable and can be problematic for constrained embedded targets. These problems are described in great detail in #9635. It's important that the above suggestions, if applied, do not negatively affect code size, compile times or runtime performance.The text was updated successfully, but these errors were encountered: