Unicode property value abbreviated names and long names #60888

PanderMusubi · 2012-12-14T17:33:13Z

BPO	16684
Nosy	@loewis, @terryjreedy, @benjaminp, @ezio-melotti, @gnprice
Files	create-unicodedata-dicts-prop-value-alias-20121223.py: Create dictionaries for unicodedata package contining property value aliases in terms of abbreviated names and long names. bc_ea_gc.py: Refactored 3.3 version prop-val-aliases.patch

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = None
created_at = <Date 2012-12-14.17:33:12.953>
labels = ['type-feature', '3.9', 'expert-unicode']
title = 'Unicode property value abbreviated names and long names'
updated_at = <Date 2019-09-20.07:56:31.472>
user = 'https://bugs.python.org/PanderMusubi'

bugs.python.org fields:

activity = <Date 2019-09-20.07:56:31.472>
actor = 'Greg Price'
assignee = 'none'
closed = False
closed_date = None
closer = None
components = ['Unicode']
creation = <Date 2012-12-14.17:33:12.953>
creator = 'PanderMusubi'
dependencies = []
files = ['28405', '28411', '48616']
hgrepos = []
issue_num = 16684
keywords = ['patch']
message_count = 10.0
messages = ['177476', '177479', '177510', '177909', '177985', '178018', '178142', '285270', '320089', '352840']
nosy_count = 6.0
nosy_names = ['loewis', 'terry.reedy', 'benjamin.peterson', 'ezio.melotti', 'PanderMusubi', 'Greg Price']
pr_nums = []
priority = 'normal'
resolution = None
stage = 'needs patch'
status = 'open'
superseder = None
type = 'enhancement'
url = 'https://bugs.python.org/issue16684'
versions = ['Python 3.9']

PanderMusubi · 2012-12-14T17:33:12Z

The package unicodedata
http://docs.python.org/3/library/unicodedata.html
offers looking up of property values in terms of general category, bidirectional class and east asian width for Unicode characters
unicodedata.category(unichr)
unicodedata.bidirectional(unichr)
unicodedata.east_asian_width(chr)

The abbreviated name of the specific category is returned. However, for certain applications it is important to be able to get the from abbreviated name to the long name and vice versa.

The data needed to do this can be found at
http://www.unicode.org/Public/UNIDATA/PropertyValueAliases.txt
under sections
# General_Category (gc)
# Bidi_Class (bc)
# East_Asian_Width (ea)
Use only the second (abbreviated name) and third (long name) fields and ignoring other fields and possible comments.

For general category, also support translation back and forth of the one-letter abbreviations which are groups representing two-letter general categories abbreviations with the same initial letter.

Please extend this package with a way of translating back and forth between abbreviated name and long name for property values defined in Unicode for general category, bidirectional class and East Asian width. This functionality should be independent of retrieving the abbreviated names for Unicode character as is available now and should be accessible via separate methods or dictionaries in which developers can perform lookups themselves.

Implementing the functionality requested in this issue allows Python developers to get from an abbreviated property value to a meaningful property value name and vice versa without having to retrieve this information from the Unicode Consortium and/or shipping this information with their code with the risk of using outdated information.

ezio-melotti · 2012-12-14T17:54:03Z

for certain applications it is important to be able to get the from
abbreviated name to the long name and vice versa.

What kind of application? I have a module where I defined my own dict that maps categories with their full names, but I'm not sure this feature is common enough that should be included and maintained in the stdlib.

If it's added, a dict is probably enough, but a script to parse the file you mentioned and update this dict should also be included.

PanderMusubi · 2012-12-14T21:20:01Z

I myself have a lot of Python applications that process font files and interact with fonttools and FontForge, which are both written in Python too. As you also have your own dict for this purpose and probably other people too, it would be justified to add these three small dicts in the standard lib. Especially since this package in the standard lib follows the definitions from Unicode Consortium.

When this is shipped in one package developers will always have an in sync translation from abbreviated names to long names and vice versa. Over the last years I needed to adjust my dicts regularly for the added definitions by Unicode Consortium which are supported by stdlib.

At the moment, translation from Unicode codes U+1234 to human-readable Unicode names and vice versa is offered at the moment. Providing human-readable names for the property values is a service of the same level and will be catering to approximately the same user group.

If you agree that these dicts can be added I am willing to provide a script that will parse the aforementioned file.

terryjreedy · 2012-12-21T23:32:07Z

This seems like a plausible request to me. The three dicts comprise 70 code-alias pairs. If unicodedata had a Python version (should it?), the simplest thing would be to add bididict, eawdist, and gcdict to that version (and not to the C version). I don't know how well putting dicts in C code works. A unicodealias module could be added but I do not really like that idea. I would prefer adding data attributes and correspond docs to the current module.

Pander: submitting a proof-of-concept script that accesses and parses that url and produces ready-to-go python code like below might encourage adoption of your proposal. In any case, it would be here for others to use.

bididict = {
    'AL': 'Arabic_Letter',
...
    'WS': 'White_Space',
}

eawdict = ...

PanderMusubi · 2012-12-23T13:34:00Z

Attached is the requested proof-of-concept script.

terryjreedy · 2012-12-23T22:11:42Z

I verified that the prototype file works in 2.7.3. I rewrote it for 3.3 using a refactored approach (and discovered that the site sometimes times out).

ezio-melotti · 2012-12-25T15:48:19Z

The script should probably be integrated in Tools/unicode/makeunicodedata.py.

PanderMusubi · 2017-01-11T20:32:05Z

Any updates or ideas on how to move this forward? Meanwhile, should the issue relate to version 3.6? Thanks. Ah, see also https://bugs.python.org/issue6331 please

PanderMusubi · 2018-06-20T16:09:10Z

Since June 2018, Unicode version 11.0 is out. Perhaps that could help move this forward.

gnprice · 2019-09-20T07:56:30Z

I've gone and implemented a version of this that's integrated into Tools/unicode/makeunicodedata.py , and into the unicodedata module. Patch attached. Demo:

>>> import unicodedata, pprint
>>> pprint.pprint(unicodedata.property_value_aliases)
{'bidirectional': {'AL': ['Arabic_Letter'],
# ...
                   'WS': ['White_Space']},
 'category': {'C': ['Other'],
# ...
 'east_asian_width': {'A': ['Ambiguous'],
# ...
                      'W': ['Wide']}}

Note that the values are lists. That's because a value can have multiple aliases in addition to its "short name":

>>> unicodedata.property_value_aliases['category'][unicodedata.category('4')]
['Decimal_Number', 'digit']

This implementation also provides the reverse mapping, from an alias to the "short name":

>>> pprint.pprint(unicodedata.property_value_by_alias)
{'bidirectional': {'Arabic_Letter': 'AL',
# ...

This draft doesn't have tests or docs, but it's otherwise complete. I've posted it at this stage for feedback on a few open questions:

This version is in C; at import time some C code builds up the dicts, from static tables in the header generated by makeunicodedata.py . It's not *that* much code... but it sure would be more convenient to do in Python instead.

Should the unicodedata module perhaps have a Python part? I'd be happy to go about that -- rename the existing C module to _unicodedata and add a small unicodedata.py wrapper -- if there's a feeling that it'd be a good idea. Then this could go there instead of using the C code I've just written.
Is this API the right one?
- This version has e.g. unicodedata.property_value_by_alias['category']['Decimal_Number'] == 'Nd' .
- Perhaps make category/bidirectional/east_asian_width into attributes rather than keys? So e.g. unicodedata.property_value_by_alias.category['Decimal_Number'] == 'Nd' .
- Or: the standard says "loose matching" should be applied to these names, so e.g. 'decimal number' or 'is-decimal-number' is equivalent to 'Decimal_Number'. To accomplish that, perhaps make it not dicts at all but functions?
  
  So e.g. unicodedata.property_value_by_alias('decimal number') == unicodedata.property_value_by_alias('Decimal_Number') == 'Nd' .
- There's also room for bikeshedding on the names.
How shall we handle ucd_3_2_0 for this feature?

This implementation doesn't attempt to record the older version of the data. My reasoning is that because the applications of the old data are quite specific and they haven't needed this information yet, it seems unlikely anyone will ever really want to know from this module just which aliases existed already in 3.2.0 and which didn't yet.

OTOH, as a convenience I've caused e.g. unicodedata.ucd_3_2_0.property_value_by_alias to exist, just pointing to the same object as unicodedata.property_value_by_alias . This allows unicodedata.ucd_3_2_0 to remain a near drop-in substitute for the unicodedata module itself, while minimizing the complexity it adds to the implementation.

Might be cleanest to just leave these off of ucd_3_2_0 entirely, though. It's still easy to get at them -- just get them from the module itself -- and it makes it explicit that you're getting current rather than old data.

PanderMusubi · 2022-12-20T23:44:21Z

For me, implementation C or Python doesn't matter.

I prefer unicodedata.property_value_by_alias.category over the other option.

As for old data, I'm more interested in the latest version.

Dies this help? What do others think and can we move this issue forward?

PanderMusubi · 2023-12-17T09:19:15Z

@gnprice @ezio-melotti @terryjreedy here a polite reminder after a year to move this forward.

terryjreedy · 2023-12-17T19:07:14Z

@isidentical @ambv @vstinner Is #96954 using a DAWG for unicodedatabase names relevant to this issue.

@gnprice If your bpo file prop-val-aliases.patch is still relevant, would you want to turn it into a GH PR, or would you rather someone else do it? I know that you are focused on zulip now.

PanderMusubi mannequin added topic-unicode type-feature A feature request or enhancement labels Dec 14, 2012

serhiy-storchaka added the 3.7 (EOL) end of life label Jan 11, 2017

ned-deily added 3.8 (EOL) end of life and removed 3.7 (EOL) end of life labels Jun 20, 2018

gnprice added 3.9 only security fixes and removed 3.8 (EOL) end of life labels Sep 20, 2019

ezio-melotti transferred this issue from another repository Apr 10, 2022

PanderMusubi mentioned this issue Dec 20, 2022

Add unicode script info to the unicode database #50580

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unicode property value abbreviated names and long names #60888

Unicode property value abbreviated names and long names #60888

PanderMusubi mannequin commented Dec 14, 2012

PanderMusubi mannequin commented Dec 14, 2012

ezio-melotti commented Dec 14, 2012

PanderMusubi mannequin commented Dec 14, 2012

terryjreedy commented Dec 21, 2012

PanderMusubi mannequin commented Dec 23, 2012

terryjreedy commented Dec 23, 2012

ezio-melotti commented Dec 25, 2012

PanderMusubi mannequin commented Jan 11, 2017

PanderMusubi mannequin commented Jun 20, 2018

gnprice commented Sep 20, 2019

PanderMusubi commented Dec 20, 2022

PanderMusubi commented Dec 17, 2023

terryjreedy commented Dec 17, 2023

Unicode property value abbreviated names and long names #60888

Unicode property value abbreviated names and long names #60888

Comments

PanderMusubi mannequin commented Dec 14, 2012

PanderMusubi mannequin commented Dec 14, 2012

ezio-melotti commented Dec 14, 2012

PanderMusubi mannequin commented Dec 14, 2012

terryjreedy commented Dec 21, 2012

PanderMusubi mannequin commented Dec 23, 2012

terryjreedy commented Dec 23, 2012

ezio-melotti commented Dec 25, 2012

PanderMusubi mannequin commented Jan 11, 2017

PanderMusubi mannequin commented Jun 20, 2018

gnprice commented Sep 20, 2019

PanderMusubi commented Dec 20, 2022

PanderMusubi commented Dec 17, 2023

terryjreedy commented Dec 17, 2023