Skip to content

regexp/syntax: recognize Unicode category aliases #70781

Closed
@rsc

Description

@rsc

The Unicode specification defines aliases for some of the general category names. For example the category "L" has alias "Letter".

The regexp package supports \p{L} but not \p{Letter}, because there is nothing in the Unicode tables that lets regexp know about Letter.

Package regexp would be a permitted implementation for use in JSON-API Schema implementations except that there are tests with usage of aliases like \p{Letter} instead of \p{L}.

In #70780 I proposed adding a new CategoryAliases table to package unicode.

If that is accepted, I propose to also recognize the category aliases in regexp/syntax, which will make them work in package regexp.

I also propose to follow https://unicode.org/reports/tr18/#General_Category_Property and add \p{Any}, \p{Assigned}, and \p{ASCII}.

Finally, I propose to make the Unicode names case-insensitive, so that \p{ascii} can be used instead of \p{ASCII}.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    Status

    Accepted

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions