-
Notifications
You must be signed in to change notification settings - Fork 1.7k
[Breaking change request] Change UTF-8 encoder and decoder to match the WHATWG encoding standard #41100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Please send the announce The change will need to wait for the beta branch snapped before it can be submitted. |
The breaking change has been announced on dart-announce, see announcement. |
The beta branch has been cut, so this is not blocked anymore. We would like to move forward with this. @franklinyow Can you help us getting the LGTM from all veto powers? |
cc @Hixie @matanlurey @dgrove @vsmenon for review and approval. |
Is there an impact on prod code size for web or native? ( @rakudrama ) |
@rakudrama and I just discussed the web situation. There is currently a 4k size increase and a significant speed regression for small inputs, but we have ideas for fixing both issues. For native there will probably be a small (a few k) constant size increase. |
One detail that we need to decide on: for JS, when Using
@lrhn @rakudrama @vsmenon WDYT? |
Use I don't think the exact behavior on malformed inputs is that important. We don't otherwise try to work around browser bugs. |
Using @askeksa-google, is the following statement true: All cases of non-standard behaviour have an output containing the replacement character. We do work around some other browser bugs - for example, fixing the unusual exponential form in IE11's Number.toString(radix), and various differences in the DOM APIs. |
That statement is true. A stricter statement which is also true is: All cases of non-standard behaviour have an output containing the replacement character as the last character or two replacement characters in a row. The "when an unfinished 4-byte sequence is interrupted after exactly 2 bytes" deviation in Chromium is unfortunate, since without that (assuming only the other observed deviations) we would only have to check the last character of the output. For mostly-ASCII inputs, it might very well be faster to always fall back than to scan the result. At least the cross-over point will be quite high. |
I feel fine. File bug reports against the browsers and hope they get around to fixing it. It's still only error handling, it's not like they do something different for a well-formed input. |
Having standards-compliant error handling can be as important as handling valid input. For one, variance in error handling is often a vector in vulnerabilities. For another, most people don't care if input is valid or not. They just care that all their software interoperates. |
This adjusts all UTF-8 tests to the new semantics in the breaking change described here: #41100 This has three parts: - Unpaired surrogates are encoded as replacement characters, and encoded surrogates are considered malformed input when decoding. - Decoding errors are generally reported on the position of the byte that conclusively makes the input malformed. - The number of replacement characters emitted by the decoder is generally one per unfinished sequence or undecodable byte. The code changes to implement the new semantics are placed in subsequent commits. Change-Id: I4cc8ce660e39287e734070764ab8e1f0ebb8b9e0 Reviewed-on: https://dart-review.googlesource.com/c/sdk/+/143815 Reviewed-by: Lasse R.H. Nielsen <[email protected]>
This implements the encoding part of the breaking change described at #41100 Change-Id: I22f2ffc24efc783a2199f640690a85c70a85e7d2 Reviewed-on: https://dart-review.googlesource.com/c/sdk/+/143818 Reviewed-by: Lasse R.H. Nielsen <[email protected]>
Two-pass decoder: the first pass scans through the input to compute the length of the resulting string and which decoder to use, and the second pass does the actual decoding. The same decoder is used for both one-shot and chunked decoding, and both with and without allowMalformed. If there is an error in the input and allowMalformed is true, it starts over with a general decoder that supports malformed input and allocates space as it goes along. JS targets go directly to the general decoder, as the two-pass approach is not beneficial here. Three pieces of the decoder are designed to be pluggable by patches to optimize the performance further: - scan, running the first pass of the conversion. - decode8, decoding Latin1 data into a OneByteString. - decode16, decoding arbitrary data into a TwoByteString. Improves decoding speed, especially for complex input (many multi-byte characters). Observed speed increases are approximately: - dart2js: up to 40% - VM JIT: up to 260% - VM AOT: up to 130% The constant overhead of calling the UTF-8 decoder is also significantly reduced for dart2js. Code size for dart2js is slightly reduced compared to the old decoder. ASCII inputs currently see a slight speed decrease for VM targets, which will be fixed in https://dart-review.googlesource.com/c/sdk/+/145460 This is part of the implementation of the breaking change described at #41100 Closes #28832 Closes #31954 Ideas for further improvements to the decoder are collected in #41734 Change-Id: I3c5bb84e8d6783231680a9d34d6c38e8a28ab112 Reviewed-on: https://dart-review.googlesource.com/c/sdk/+/142025 Reviewed-by: Stephen Adams <[email protected]> Reviewed-by: Martin Kustermann <[email protected]>
This brings JSON encoding and decoding in line with the UTF-8 changes described at #41100 The fused UTF-8 / JSON decoder for the VM now uses the new UTF-8 decoder instead of its own, separate UTF-8 decoder. The JSON encoder now escapes lone surrogates, so it can encode JSON string values containing lone surrogates while keeping its output valid UTF-8. Change-Id: Ie4d4601cf84012068849e64d4670f2dcd49ea088 Reviewed-on: https://dart-review.googlesource.com/c/sdk/+/144286 Reviewed-by: Lasse R.H. Nielsen <[email protected]>
Having consistent behavior is a good thing, but there is a limit to how much work should be spent on mitigating browser bugs. I don't think the impact here is big enough to warrant adding detection and workarounds for something which is, arguably, already an error situation. |
Update package:ffi to a version which does not depend on unpaired surrogates. Breaking change in Dart: #41100. Change-Id: I2a5ba0abee7c6cccb166c234f8f620dbe0063d47 Cq-Include-Trybots: luci.dart.try:vm-ffi-android-debug-arm-try,vm-ffi-android-debug-arm64-try Reviewed-on: https://dart-review.googlesource.com/c/sdk/+/146340 Reviewed-by: Aske Simon Christensen <[email protected]> Commit-Queue: Daco Harkes <[email protected]>
I benchmarked a post-scan for replacement characters. Here it adds about 6ns per byte on top of the |
With the changes in #41100 the handling of UTF-8 encoded surrogates in Dart now matches that of JS. Thus, the pre-pass that scans for the presence of surrogates before handing the data to TextDecoder is no longer needed. Removing this gives a significant speedup. On my laptop, in Chrome, on the Utf8Decode benchmark, it gives around 1ns per input byte out of previously roughly 2.5ns (ASCII) to 5ns (Russian). In principle, this also enables TextDecoder for allowMalformed: true, since the number of replacement characters produced by Dart now matches the WHATWG standard. This does result in failures in some browsers, where these no not adhere to the standard. For instance, Chrome outputs one replacement character per undecoded input byte when an unfinished sequence is interrupted by end-of-input, where the standard specifies only one replacement character. To work around the browser deviations, the output from TextDecoder is scanned for replacement characters, and if any are found, the decoding falls back to the Dart implementation. This workaround can be removed if the bugs are fixed in the browsers. Since TextDecoder has a large startup overhead, we also fall back to the Dart implementation for short strings. Change-Id: I9e95a95ce726ce0d9e9a3b46df8ee2512ab05f0a Reviewed-on: https://dart-review.googlesource.com/c/sdk/+/144294 Commit-Queue: Aske Simon Christensen <[email protected]> Reviewed-by: Stephen Adams <[email protected]>
The breaking change #41100 changed the UTF-8 encoder to encode unpaired surrogates as replacement characters. However, the VM contains its own, internal UTF-8 encoder, which is used for printing and for the Dart_StringToUTF8 function in the Dart API. Here, this encoder is changed to also encode unpaired surrogates as replacement characters. Fixes #42094 Change-Id: I9d55168f67d124dbc7987fb759696a98e7526c29 Reviewed-on: https://dart-review.googlesource.com/c/sdk/+/149292 Commit-Queue: Aske Simon Christensen <[email protected]> Reviewed-by: Martin Kustermann <[email protected]> Reviewed-by: Daco Harkes <[email protected]>
Summary
Change encoding and decoding of UTF-8 to conform to the WHATWG encoding standard. This means that it will never emit invalid UTF-8, only accept valid UTF-8 and be compatible with the
TextEncoder
andTextDecoder
classes in JavaScript.Related issues: #7046, #22330, #28832, #31370, #31954
What is changing:
Utf8Codec
orUtf8Decoder
class, the input is considered malformed if it contains an encoded surrogate character (code point in the rangeU+D800
-U+DFFF
, encoded in UTF-8 as a 3-byte character encoding where the first byte is0xED
and the second byte is in the range0xA0
-0xBF
).Utf8Codec
orUtf8Encoder
class, and the string contains an unpaired surrogate, that surrogate is emitted as a replacement character (U+FFFD
, encoded in UTF-8 as0xEF
,0xBF
,0xBD
) instead of an encoded surrogate (which is invalid UTF-8). For chunked conversion, if a chunk ends with a high surrogate and the next chunk starts with a low surrogate, these surrogates are considered properly paired and are combined, like before.allowMalformed
set totrue
, the number of replacement characters emitted will sometimes differ from the number currently emitted. Specifically, the decoder will emit one replacement character for each maximal sequence of input bytes that is eitherallowMalformed
set tofalse
, theoffset
in the resultingFormatException
will point to the first byte from which the decoder can conclude that the sequence is malformed, rather than the first byte that was not decoded successfully. Also, themessage
of theFormatException
will sometimes be different from what it is currently. If the input contains more than one error, theFormatException
may point to a different error than before.Why is this changing?
Dart strings (like JS and Java strings) may contain unpaired surrogates. The current strategy of allowing surrogates when encoding and decoding UTF-8 ensures that any Dart string can be encoded as UTF-8 (actually, WTF-8) and decoded back into the original string.
This strategy has a number of drawbacks:
TextEncoder
andTextDecoder
classes. It must do some or all of the conversion in Dart code, which has a significant performance cost.The purpose of the change is thus to:
Expected impact
Programs manipulating strings through usual string operations are unlikely to be affected.
A program may be affected by this change if it does any of the following:
Mitigation
For the scenarios listed above:
Variations
An optional
allowSurrogates
parameter could be added to the encoder and decoder to support the round-trip use case. To obtain the performance benefits, it should default tofalse
. This could introduce further breakage for programs implementing theUtf8Codec
interface (unless we only put the flag on the constructors).If the surrogate change is considered too risky, the error and replacement character changes on their own can still ease the VM optimizations and possibly improve the performance of JS when
allowMalformed
is set totrue
.The text was updated successfully, but these errors were encountered: