- Sponsor
-
Notifications
You must be signed in to change notification settings - Fork 5.6k
make codepoint(c) work for overlong chars #55152
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
These changes seems to be at odds with other docs, because now Lines 4 to 10 in 197295c
Also, the information that a Lines 40 to 44 in 197295c
|
The basic problem is that this is an oversimplification. The I updated the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great to me
Unrelated CI failure, updating and re-running CI. |
The docstring of |
Triage likes but would also like for "malformed" to be documented somewhere and to adjust the docstring of codepoint to refer to malformed rather than invalid. @LilithHafner feels that it would be good to block the publicness of
An additional comment regarding equality and comparison:
This allows each string type to define a total ordering on valid and invalid strings in a way that's efficient and consistent within the type, but comparisons of invalid strings across types can simply error since there's no sensible way to implement that and forcing it to be consistent would force valid comparisons to be done inefficiently. |
It seems like the triage requests were all already addressed.
Removing the "merge me" label, however, until it is clear that everyone is satisfied. |
Windows build failure looks unrelated: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like a more complete/specific/accessible definition of malformed vs invalid, or a link to the specific part of the unicode standard that defines it; but I don't think that is blocking given the level of docs already in this PR.
I didn't find a definition for either word, but did find definitions for their synonyms/antonyms..
xref:
|
Assuming UTF-8, Unicode specifies that any code unit sequence not listed in this table is ill-formed and not well-formed. IIUC, our definition of malformed is different from Unicode's definition of ill-formed. For example, overlong characters are not |
There's no official definition of "malformed". However, IMO this is a missing concept from the Unicode spec. The spec is generally a bit vague and self-contradictory on handling invalid data. There's a difference between a character that is sensibly encoded but not allowed—e.g. because it is a surrogate code point or too high or an overlong encoding—and data that simply doesn't follow the expected structure of an encoding. This distinction only applies to UTF-8 because any sequence of code units in UTF-16 is well-formed—the only encoding error you can have is unpaired surrogates, which is well-formed but invalid. (The only way you can have truly malformed UTF-16 is if you have an odd number of bytes so your last code unit is incomplete.) Why does this distinction matter? There are a few reasons:
Our definition of "well-formed UTF-8-like data" is as sequence of bytes of the following forms:
Any such sequence can be mapped to a code point value. |
(Fixed merge conflicts.) |
As discussed in #54393,
codepoint(c)
should succeed for overlong encodings, and wheneverismalformed(c)
returnsfalse
. This should be backwards compatible since it simply removes an error, and should be strictly faster than before since it merely removes a call toBase.is_overlong_enc
.Also,
Base.ismalformed
andBase.isoverlong
are declaredpublic
(but not yet exported) and are included in the manual, since they are referenced in the docstring ofcodepoint
etcetera. I also madeBase.show_invalid
a
public
and documented function, since it is referenced from theismalformed
docs and is required by new implementations ofAbstractChar
types that support malformed data.Fixes #54343, closes #54393.