Closed
Description
The Unicode specification defines aliases for some of the general category names. For example the category "L" has alias "Letter".
The regexp package supports \p{L} but not \p{Letter}, because there is nothing in the Unicode tables that lets regexp know about Letter.
In order to support \p{Letter}, I propose to add a new, small table to unicode,
var CategoryAliases = map[string]string{
"Other": "C",
"Control": "Cc",
...,
"Letter": "L",
...
}
This would be auto-generated from the Unicode database like all our other tables. For Unicode 15, the table would have only 38 entries, listed below.
% grep '^gc' PropertyValueAliases.txt
gc ; C ; Other # Cc | Cf | Cn | Co | Cs
gc ; Cc ; Control ; cntrl
gc ; Cf ; Format
gc ; Cn ; Unassigned
gc ; Co ; Private_Use
gc ; Cs ; Surrogate
gc ; L ; Letter # Ll | Lm | Lo | Lt | Lu
gc ; LC ; Cased_Letter # Ll | Lt | Lu
gc ; Ll ; Lowercase_Letter
gc ; Lm ; Modifier_Letter
gc ; Lo ; Other_Letter
gc ; Lt ; Titlecase_Letter
gc ; Lu ; Uppercase_Letter
gc ; M ; Mark ; Combining_Mark # Mc | Me | Mn
gc ; Mc ; Spacing_Mark
gc ; Me ; Enclosing_Mark
gc ; Mn ; Nonspacing_Mark
gc ; N ; Number # Nd | Nl | No
gc ; Nd ; Decimal_Number ; digit
gc ; Nl ; Letter_Number
gc ; No ; Other_Number
gc ; P ; Punctuation ; punct # Pc | Pd | Pe | Pf | Pi | Po | Ps
gc ; Pc ; Connector_Punctuation
gc ; Pd ; Dash_Punctuation
gc ; Pe ; Close_Punctuation
gc ; Pf ; Final_Punctuation
gc ; Pi ; Initial_Punctuation
gc ; Po ; Other_Punctuation
gc ; Ps ; Open_Punctuation
gc ; S ; Symbol # Sc | Sk | Sm | So
gc ; Sc ; Currency_Symbol
gc ; Sk ; Modifier_Symbol
gc ; Sm ; Math_Symbol
gc ; So ; Other_Symbol
gc ; Z ; Separator # Zl | Zp | Zs
gc ; Zl ; Line_Separator
gc ; Zp ; Paragraph_Separator
gc ; Zs ; Space_Separator
%
Metadata
Metadata
Assignees
Type
Projects
Status
Accepted
Milestone
Relationships
Development
No branches or pull requests
Activity
gabyhelp commentedon Dec 11, 2024
Related Issues
(Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.)
rsc commentedon Jan 8, 2025
I implemented this, and there are a few additions. The proposal is now:
[-]proposal: unicode: add CategoryAliases[/-][+]proposal: unicode: add CategoryAliases, LC, Cn[/+]gopherbot commentedon Jan 8, 2025
Change https://go.dev/cl/641395 mentions this issue:
internal/export/unicode: add CategoryAliases, Cn, and LC
gopherbot commentedon Jan 8, 2025
Change https://go.dev/cl/641376 mentions this issue:
unicode: add CategoryAliases, Cn, LC
gopherbot commentedon Jan 8, 2025
Change https://go.dev/cl/641377 mentions this issue:
regexp/syntax: recognize category aliases like \p{Letter}
rsc commentedon Feb 5, 2025
This proposal has been added to the active column of the proposals project
and will now be reviewed at the weekly proposal review meetings.
— rsc for the proposal review group
willfaught commentedon Feb 7, 2025
Could there be any compatibility issues with new Unicode versions? Dropped or renamed or changed aliases?
Will regexp then use the map?
Edit: The changes to regexp are at #70781.
rsc commentedon Feb 12, 2025
In general, Unicode data is subject to change as Unicode changes. That said, I don't expect aliases to be deleted from the list. (We've seen them change the category of an individual code point in the past, but even that is rare.)
5 remaining items
[-]proposal: unicode: add CategoryAliases, LC, Cn[/-][+]unicode: add CategoryAliases, LC, Cn[/+]internal/export/unicode: add CategoryAliases, Cn, and LC
regexp/syntax: recognize category aliases like \p{Letter}
unicode: add CategoryAliases, Cn, LC
\p{...}
and\P{...}
format for Go regexp ogen-go/ogen#1475