Skip to content

unicode: add CategoryAliases, LC, Cn #70780

Closed
@rsc

Description

@rsc
Contributor

The Unicode specification defines aliases for some of the general category names. For example the category "L" has alias "Letter".

The regexp package supports \p{L} but not \p{Letter}, because there is nothing in the Unicode tables that lets regexp know about Letter.

In order to support \p{Letter}, I propose to add a new, small table to unicode,

var CategoryAliases = map[string]string{
	"Other": "C",
	"Control": "Cc",
	...,
	"Letter": "L",
	...
}

This would be auto-generated from the Unicode database like all our other tables. For Unicode 15, the table would have only 38 entries, listed below.

% grep '^gc' PropertyValueAliases.txt
gc ; C                                ; Other                            # Cc | Cf | Cn | Co | Cs
gc ; Cc                               ; Control                          ; cntrl
gc ; Cf                               ; Format
gc ; Cn                               ; Unassigned
gc ; Co                               ; Private_Use
gc ; Cs                               ; Surrogate
gc ; L                                ; Letter                           # Ll | Lm | Lo | Lt | Lu
gc ; LC                               ; Cased_Letter                     # Ll | Lt | Lu
gc ; Ll                               ; Lowercase_Letter
gc ; Lm                               ; Modifier_Letter
gc ; Lo                               ; Other_Letter
gc ; Lt                               ; Titlecase_Letter
gc ; Lu                               ; Uppercase_Letter
gc ; M                                ; Mark                             ; Combining_Mark                   # Mc | Me | Mn
gc ; Mc                               ; Spacing_Mark
gc ; Me                               ; Enclosing_Mark
gc ; Mn                               ; Nonspacing_Mark
gc ; N                                ; Number                           # Nd | Nl | No
gc ; Nd                               ; Decimal_Number                   ; digit
gc ; Nl                               ; Letter_Number
gc ; No                               ; Other_Number
gc ; P                                ; Punctuation                      ; punct                            # Pc | Pd | Pe | Pf | Pi | Po | Ps
gc ; Pc                               ; Connector_Punctuation
gc ; Pd                               ; Dash_Punctuation
gc ; Pe                               ; Close_Punctuation
gc ; Pf                               ; Final_Punctuation
gc ; Pi                               ; Initial_Punctuation
gc ; Po                               ; Other_Punctuation
gc ; Ps                               ; Open_Punctuation
gc ; S                                ; Symbol                           # Sc | Sk | Sm | So
gc ; Sc                               ; Currency_Symbol
gc ; Sk                               ; Modifier_Symbol
gc ; Sm                               ; Math_Symbol
gc ; So                               ; Other_Symbol
gc ; Z                                ; Separator                        # Zl | Zp | Zs
gc ; Zl                               ; Line_Separator
gc ; Zp                               ; Paragraph_Separator
gc ; Zs                               ; Space_Separator
%

Activity

added this to the Proposal milestone on Dec 11, 2024
gabyhelp

gabyhelp commented on Dec 11, 2024

@gabyhelp

Related Issues

(Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.)

moved this to Incoming in Proposalson Dec 11, 2024
rsc

rsc commented on Jan 8, 2025

@rsc
ContributorAuthor

I implemented this, and there are a few additions. The proposal is now:

  • Add CategoryAliases as described above, but note that there are 38+4 = 42 entries, to include the secondary aliases cntrl, Combining_Mark, digit, and punct, shown in the tables above.
  • Add a new LC table (var LC and "LC": LC entry in Categories). This is a synthesized category (cased letter = Lu | Ll | Lt) that was missing before. I noticed because it has an alias but did not exist in the first place.
  • Add a new Cn table (var Cn and "Cn": Cn entry in Categories). This is also a synthesized category with an alias but which did not exist. It is all unassigned code points (no category).
changed the title [-]proposal: unicode: add CategoryAliases[/-] [+]proposal: unicode: add CategoryAliases, LC, Cn[/+] on Jan 8, 2025
gopherbot

gopherbot commented on Jan 8, 2025

@gopherbot
Contributor

Change https://go.dev/cl/641395 mentions this issue: internal/export/unicode: add CategoryAliases, Cn, and LC

gopherbot

gopherbot commented on Jan 8, 2025

@gopherbot
Contributor

Change https://go.dev/cl/641376 mentions this issue: unicode: add CategoryAliases, Cn, LC

gopherbot

gopherbot commented on Jan 8, 2025

@gopherbot
Contributor

Change https://go.dev/cl/641377 mentions this issue: regexp/syntax: recognize category aliases like \p{Letter}

rsc

rsc commented on Feb 5, 2025

@rsc
ContributorAuthor

This proposal has been added to the active column of the proposals project
and will now be reviewed at the weekly proposal review meetings.
— rsc for the proposal review group

moved this from Incoming to Active in Proposalson Feb 5, 2025
willfaught

willfaught commented on Feb 7, 2025

@willfaught
Contributor

Could there be any compatibility issues with new Unicode versions? Dropped or renamed or changed aliases?

Will regexp then use the map?

Edit: The changes to regexp are at #70781.

rsc

rsc commented on Feb 12, 2025

@rsc
ContributorAuthor

In general, Unicode data is subject to change as Unicode changes. That said, I don't expect aliases to be deleted from the list. (We've seen them change the category of an individual code point in the past, but even that is rare.)

5 remaining items

moved this from Likely Accept to Accepted in Proposalson Feb 26, 2025
changed the title [-]proposal: unicode: add CategoryAliases, LC, Cn[/-] [+]unicode: add CategoryAliases, LC, Cn[/+] on Feb 26, 2025
modified the milestones: Proposal, Backlog on Feb 26, 2025
added a commit that references this issue on Apr 18, 2025
28fd9fa
modified the milestones: Backlog, Go1.25 on Apr 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    Status

    Accepted

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @willfaught@rsc@dmitshur@aclements@gopherbot

        Issue actions

          unicode: add CategoryAliases, LC, Cn · Issue #70780 · golang/go