-
-
Notifications
You must be signed in to change notification settings - Fork 31.9k
Unicode property value abbreviated names and long names #60888
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
The package unicodedata The abbreviated name of the specific category is returned. However, for certain applications it is important to be able to get the from abbreviated name to the long name and vice versa. The data needed to do this can be found at For general category, also support translation back and forth of the one-letter abbreviations which are groups representing two-letter general categories abbreviations with the same initial letter. Please extend this package with a way of translating back and forth between abbreviated name and long name for property values defined in Unicode for general category, bidirectional class and East Asian width. This functionality should be independent of retrieving the abbreviated names for Unicode character as is available now and should be accessible via separate methods or dictionaries in which developers can perform lookups themselves. Implementing the functionality requested in this issue allows Python developers to get from an abbreviated property value to a meaningful property value name and vice versa without having to retrieve this information from the Unicode Consortium and/or shipping this information with their code with the risk of using outdated information. |
What kind of application? I have a module where I defined my own dict that maps categories with their full names, but I'm not sure this feature is common enough that should be included and maintained in the stdlib. If it's added, a dict is probably enough, but a script to parse the file you mentioned and update this dict should also be included. |
I myself have a lot of Python applications that process font files and interact with fonttools and FontForge, which are both written in Python too. As you also have your own dict for this purpose and probably other people too, it would be justified to add these three small dicts in the standard lib. Especially since this package in the standard lib follows the definitions from Unicode Consortium. When this is shipped in one package developers will always have an in sync translation from abbreviated names to long names and vice versa. Over the last years I needed to adjust my dicts regularly for the added definitions by Unicode Consortium which are supported by stdlib. At the moment, translation from Unicode codes U+1234 to human-readable Unicode names and vice versa is offered at the moment. Providing human-readable names for the property values is a service of the same level and will be catering to approximately the same user group. If you agree that these dicts can be added I am willing to provide a script that will parse the aforementioned file. |
This seems like a plausible request to me. The three dicts comprise 70 code-alias pairs. If unicodedata had a Python version (should it?), the simplest thing would be to add bididict, eawdist, and gcdict to that version (and not to the C version). I don't know how well putting dicts in C code works. A unicodealias module could be added but I do not really like that idea. I would prefer adding data attributes and correspond docs to the current module. Pander: submitting a proof-of-concept script that accesses and parses that url and produces ready-to-go python code like below might encourage adoption of your proposal. In any case, it would be here for others to use. bididict = {
'AL': 'Arabic_Letter',
...
'WS': 'White_Space',
} eawdict = ... |
Attached is the requested proof-of-concept script. |
I verified that the prototype file works in 2.7.3. I rewrote it for 3.3 using a refactored approach (and discovered that the site sometimes times out). |
The script should probably be integrated in Tools/unicode/makeunicodedata.py. |
Any updates or ideas on how to move this forward? Meanwhile, should the issue relate to version 3.6? Thanks. Ah, see also https://bugs.python.org/issue6331 please |
Since June 2018, Unicode version 11.0 is out. Perhaps that could help move this forward. |
I've gone and implemented a version of this that's integrated into Tools/unicode/makeunicodedata.py , and into the unicodedata module. Patch attached. Demo: >>> import unicodedata, pprint
>>> pprint.pprint(unicodedata.property_value_aliases)
{'bidirectional': {'AL': ['Arabic_Letter'],
# ...
'WS': ['White_Space']},
'category': {'C': ['Other'],
# ...
'east_asian_width': {'A': ['Ambiguous'],
# ...
'W': ['Wide']}} Note that the values are lists. That's because a value can have multiple aliases in addition to its "short name": >>> unicodedata.property_value_aliases['category'][unicodedata.category('4')]
['Decimal_Number', 'digit'] This implementation also provides the reverse mapping, from an alias to the "short name": >>> pprint.pprint(unicodedata.property_value_by_alias)
{'bidirectional': {'Arabic_Letter': 'AL',
# ... This draft doesn't have tests or docs, but it's otherwise complete. I've posted it at this stage for feedback on a few open questions:
|
For me, implementation C or Python doesn't matter. I prefer unicodedata.property_value_by_alias.category over the other option. As for old data, I'm more interested in the latest version. Dies this help? What do others think and can we move this issue forward? |
@gnprice @ezio-melotti @terryjreedy here a polite reminder after a year to move this forward. |
@isidentical @ambv @vstinner Is #96954 using a DAWG for unicodedatabase names relevant to this issue. @gnprice If your bpo file prop-val-aliases.patch is still relevant, would you want to turn it into a GH PR, or would you rather someone else do it? I know that you are focused on zulip now. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: