make codepoint(c) work for overlong chars #55152

stevengj · 2024-07-17T15:20:38Z

As discussed in #54393, codepoint(c) should succeed for overlong encodings, and whenever ismalformed(c) returns false. This should be backwards compatible since it simply removes an error, and should be strictly faster than before since it merely removes a call to Base.is_overlong_enc.

Also, Base.ismalformed and Base.isoverlong are declared public (but not yet exported) and are included in the manual, since they are referenced in the docstring of codepoint etcetera. I also made Base.show_invalid
a public and documented function, since it is referenced from the ismalformed docs and is required by new implementations of AbstractChar types that support malformed data.

Fixes #54343, closes #54393.

nhz2 · 2024-07-28T16:20:54Z

These changes seems to be at odds with other docs, because now codepoint(a) == codepoint(b) no longer implies a == b

julia/base/char.jl

Lines 4 to 10 in 197295c

    
           The `AbstractChar` type is the supertype of all character implementations 
        
           in Julia. A character represents a Unicode code point, and can be converted 
        
           to an integer via the [`codepoint`](@ref) function in order to obtain the 
        
           numerical value of the code point, or constructed from the same integer. 
        
           These numerical values determine how characters are compared with `<` and `==`, 
        
           for example.  New `T <: AbstractChar` types should define a `codepoint(::T)` 
        
           method and a `T(::UInt32)` constructor, at minimum.

Also, the information that a Char is overlong will be silently destroyed by conversion to UInt32 instead of throwing an error.

julia/base/char.jl

Lines 40 to 44 in 197295c

    
           In order to losslessly represent arbitrary byte streams stored in a `String`, 
        
           a `Char` value may store information that cannot be converted to a Unicode 
        
           codepoint — converting such a `Char` to `UInt32` will throw an error. 
        
           The [`isvalid(c::Char)`](@ref) function can be used to query whether `c` 
        
           represents a valid Unicode character.

stevengj · 2024-07-28T19:54:05Z

A character represents a Unicode code point

The basic problem is that this is an oversimplification. The Char type represents the encoding, not just the codepoint, and can represent byte sequences that don’t encode Unicode code points.

I updated the AbstractChar docs to be more accurate.

StefanKarpinski

Looks great to me

stevengj · 2025-01-02T18:55:14Z

Unrelated CI failure, updating and re-running CI.

base/char.jl

LilithHafner · 2025-01-02T19:26:49Z

The docstring of codepoint currently reads "...throw an exception if c does not represent a valid character...". That should be changes to "...represent a malformed character..." and link to the definition of invalid but not malformed.

StefanKarpinski · 2025-01-02T19:32:50Z

Triage likes but would also like for "malformed" to be documented somewhere and to adjust the docstring of codepoint to refer to malformed rather than invalid. @LilithHafner feels that it would be good to block the publicness of ismalformed on documentation of what it means, so maybe that's a good ordering:

Add docstring for ismalformed defining what it does
Make ismalformed public
Update codepoint docstring to refer to malformed versus invalid

An additional comment regarding equality and comparison:

Valid strings are compared as lexicographically ordered sequences of code points
A valid string and an invalid string must never be equal
Comparison of invalid strings is implementation-defined and may error but should be an ordering:
- Reflexive: s == s for all strings
- Antisymmetric: s <= t and t <= s implies s == t
- Transitive: s <= t and t <= u implies s <= u
- Total: either s <= t or t <= s or both are an error

This allows each string type to define a total ordering on valid and invalid strings in a way that's efficient and consistent within the type, but comparisons of invalid strings across types can simply error since there's no sensible way to implement that and forcing it to be consistent would force valid comparisons to be done inefficiently.

stevengj · 2025-01-02T19:40:56Z

It seems like the triage requests were all already addressed.

ismalformed already has a docstring in this PR (and is included in the manual)
ismalformed is already public (but not exported) in this PR
the codepoint docs already refer to malformed rather than valid in this PR

Removing the "merge me" label, however, until it is clear that everyone is satisfied.

base/char.jl

stevengj · 2025-01-02T21:57:12Z

Windows build failure looks unrelated: ERROR: Unable to open agent private key path 'C:\secrets/agent.key'! Make sure your agent has this file deployed within it!

LilithHafner

I'd like a more complete/specific/accessible definition of malformed vs invalid, or a link to the specific part of the unicode standard that defines it; but I don't think that is blocking given the level of docs already in this PR.

inkydragon · 2025-01-03T03:31:45Z

malformed vs invalid

I didn't find a definition for either word, but did find definitions for their synonyms/antonyms..
Glossary of Unicode Terms

D84 Ill-formed: A Unicode code unit sequence that purports to be in a Unicode encoding form is called ill-formed if and only if it does not follow the specification of that Unicode encoding form.

Any code unit sequence that would correspond to a code point outside the defined range of Unicode scalar values would, for example, be ill-formed.

UTF-8 has some strong constraints on the possible byte ranges for leading and trailing bytes. A violation of those constraints would produce a code unit sequence that could not be mapped to a Unicode scalar value, resulting in an ill-formed code unit sequence.

xref:

D89 In a Unicode encoding form: ...

A Unicode string consisting of a well-formed UTF-8 code unit sequence is said to be in UTF-8. Such a Unicode string is referred to as a valid UTF-8 string, or a UTF-8 string for short.

LilithHafner · 2025-01-03T15:34:07Z

Assuming UTF-8,

Unicode specifies that any code unit sequence not listed in this table is ill-formed and not well-formed.

IIUC, our definition of malformed is different from Unicode's definition of ill-formed. For example, overlong characters are not Base.ismalformed but are ill-formed according to Unicode.

StefanKarpinski · 2025-03-25T15:34:37Z

There's no official definition of "malformed". However, IMO this is a missing concept from the Unicode spec. The spec is generally a bit vague and self-contradictory on handling invalid data. There's a difference between a character that is sensibly encoded but not allowed—e.g. because it is a surrogate code point or too high or an overlong encoding—and data that simply doesn't follow the expected structure of an encoding. This distinction only applies to UTF-8 because any sequence of code units in UTF-16 is well-formed—the only encoding error you can have is unpaired surrogates, which is well-formed but invalid. (The only way you can have truly malformed UTF-16 is if you have an odd number of bytes so your last code unit is incomplete.) Why does this distinction matter? There are a few reasons:

If you're trying to decode an invalid string and you encounter a well-formed but invalid character, you should treat it as a single invalid character, not multiple invalid code units; if you encounter malformed data, you should treat it as multiple invalid code units in the way that the Unicode spec recommends for producing replacement characters.
It's coherent to ask some kinds of questions about well-formed but invalid characters. For example, you can ask what code point a well-formed character encodes. It may be an illegal one (surrogate, too high), or it may be an illegal encoding of a valid code point (overlong), but there's a single sensible answer to the question and this answer is sometimes useful. For example, if the data is WTF-8 or Modified UTF-8 or CESU-8, these are all slight variations on UTF-8 that are mostly valid but deviate from the standard in minor ways. If the data is malformed, on the other hand, then there simply is no coherent answer to "what code point does this encode".

Our definition of "well-formed UTF-8-like data" is as sequence of bytes of the following forms:

b1 where leading_ones(b1) == 0
b1, b2 where leading_ones(b1) == 2 && leading_ones(b2) == 1
b1, b2, b3 where leading_ones(b1) == 3 && leading_ones(b2) == leading_ones(b3) == 1
b1, b2, b3, b4 where leading_ones(b1) == 4 && leading_ones(b2) == leading_ones(b3) == leading_ones(b4) == 1

Any such sequence can be mapped to a code point value.

stevengj · 2025-03-26T12:19:33Z

(Fixed merge conflicts.)

NEWS.md

make codepoint(c) work for overlong chars

Loading
Loading status checks…

0b6cf37

stevengj added the unicode label Jul 17, 2024

stevengj assigned StefanKarpinski Jul 17, 2024

add PR # to NEWS

Loading
Loading status checks…

28f59dd

stevengj added 4 commits July 30, 2024 19:10

clarify AbstractChar docs

Loading
Loading status checks…

1cb0291

Merge branch 'master' into codepoint_overlong

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.

GPG key ID: B5690EEEBB952194

Verified
Learn about vigilant mode

Loading
Loading status checks…

6df8031

Update char.jl: rm trailing whitespace

Loading
Loading status checks…

bb40574

Merge branch 'master' into codepoint_overlong

Loading
Loading status checks…

596469b

stevengj added the triage label Jan 1, 2025

stevengj mentioned this pull request Jan 1, 2025

add hascodepoint(c::AbstractChar) and use it #54393

Open

Merge branch 'master' into codepoint_overlong

Loading
Loading status checks…

7a15f00

StefanKarpinski approved these changes Jan 2, 2025

View reviewed changes

Merge branch 'master' into codepoint_overlong

Loading
Loading status checks…

a60f0c6

stevengj added merge me and removed triage labels Jan 2, 2025

LilithHafner reviewed Jan 2, 2025

View reviewed changes

base/char.jl Outdated Show resolved Hide resolved

move public decls to public.jl

Loading
Loading status checks…

f778fab

stevengj removed the merge me label Jan 2, 2025

LilithHafner reviewed Jan 2, 2025

View reviewed changes

base/char.jl Outdated Show resolved Hide resolved

base/char.jl Outdated Show resolved Hide resolved

base/char.jl Show resolved Hide resolved

stevengj added 2 commits January 2, 2025 15:46

Update base/char.jl

Loading
Loading status checks…

b8a06bd

not well-formed -> malformed

Loading
Loading status checks…

c717418

LilithHafner approved these changes Jan 2, 2025

View reviewed changes

stevengj added the merge me label Jan 3, 2025

LilithHafner removed the merge me label Jan 3, 2025

Merge branch 'master' into codepoint_overlong

Loading
Loading status checks…

a368ab6

oscardssmith reviewed Mar 26, 2025

View reviewed changes

NEWS.md Show resolved Hide resolved

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

make codepoint(c) work for overlong chars #55152

make codepoint(c) work for overlong chars #55152

stevengj commented Jul 17, 2024 •

edited

Loading

Uh oh!

nhz2 commented Jul 28, 2024

Uh oh!

stevengj commented Jul 28, 2024 •

edited

Loading

Uh oh!

StefanKarpinski left a comment

Uh oh!

stevengj commented Jan 2, 2025

Uh oh!

Uh oh!

LilithHafner commented Jan 2, 2025

Uh oh!

StefanKarpinski commented Jan 2, 2025

Uh oh!

stevengj commented Jan 2, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

stevengj commented Jan 2, 2025

Uh oh!

LilithHafner left a comment

Uh oh!

inkydragon commented Jan 3, 2025 •

edited

Loading

Uh oh!

LilithHafner commented Jan 3, 2025

Uh oh!

StefanKarpinski commented Mar 25, 2025 •

edited

Loading

Uh oh!

stevengj commented Mar 26, 2025

Uh oh!

Uh oh!

make codepoint(c) work for overlong chars #55152

Are you sure you want to change the base?

make codepoint(c) work for overlong chars #55152

Conversation

stevengj commented Jul 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nhz2 commented Jul 28, 2024

Uh oh!

stevengj commented Jul 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

StefanKarpinski left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

stevengj commented Jan 2, 2025

Uh oh!

Uh oh!

LilithHafner commented Jan 2, 2025

Uh oh!

StefanKarpinski commented Jan 2, 2025

Uh oh!

stevengj commented Jan 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

stevengj commented Jan 2, 2025

Uh oh!

LilithHafner left a comment

Choose a reason for hiding this comment

Uh oh!

inkydragon commented Jan 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LilithHafner commented Jan 3, 2025

Uh oh!

StefanKarpinski commented Mar 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

stevengj commented Mar 26, 2025

Uh oh!

Uh oh!

stevengj commented Jul 17, 2024 •

edited

Loading

stevengj commented Jul 28, 2024 •

edited

Loading

stevengj commented Jan 2, 2025 •

edited

Loading

inkydragon commented Jan 3, 2025 •

edited

Loading

StefanKarpinski commented Mar 25, 2025 •

edited

Loading