Skip to content

Unicode property value abbreviated names and long names #60888

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
PanderMusubi mannequin opened this issue Dec 14, 2012 · 13 comments
Open

Unicode property value abbreviated names and long names #60888

PanderMusubi mannequin opened this issue Dec 14, 2012 · 13 comments
Labels
3.9 only security fixes topic-unicode type-feature A feature request or enhancement

Comments

@PanderMusubi
Copy link
Mannequin

PanderMusubi mannequin commented Dec 14, 2012

BPO 16684
Nosy @loewis, @terryjreedy, @benjaminp, @ezio-melotti, @gnprice
Files
  • create-unicodedata-dicts-prop-value-alias-20121223.py: Create dictionaries for unicodedata package contining property value aliases in terms of abbreviated names and long names.
  • bc_ea_gc.py: Refactored 3.3 version
  • prop-val-aliases.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = None
    created_at = <Date 2012-12-14.17:33:12.953>
    labels = ['type-feature', '3.9', 'expert-unicode']
    title = 'Unicode property value abbreviated names and long names'
    updated_at = <Date 2019-09-20.07:56:31.472>
    user = 'https://bugs.python.org/PanderMusubi'

    bugs.python.org fields:

    activity = <Date 2019-09-20.07:56:31.472>
    actor = 'Greg Price'
    assignee = 'none'
    closed = False
    closed_date = None
    closer = None
    components = ['Unicode']
    creation = <Date 2012-12-14.17:33:12.953>
    creator = 'PanderMusubi'
    dependencies = []
    files = ['28405', '28411', '48616']
    hgrepos = []
    issue_num = 16684
    keywords = ['patch']
    message_count = 10.0
    messages = ['177476', '177479', '177510', '177909', '177985', '178018', '178142', '285270', '320089', '352840']
    nosy_count = 6.0
    nosy_names = ['loewis', 'terry.reedy', 'benjamin.peterson', 'ezio.melotti', 'PanderMusubi', 'Greg Price']
    pr_nums = []
    priority = 'normal'
    resolution = None
    stage = 'needs patch'
    status = 'open'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue16684'
    versions = ['Python 3.9']

    @PanderMusubi
    Copy link
    Mannequin Author

    PanderMusubi mannequin commented Dec 14, 2012

    The package unicodedata
    http://docs.python.org/3/library/unicodedata.html
    offers looking up of property values in terms of general category, bidirectional class and east asian width for Unicode characters
    unicodedata.category(unichr)
    unicodedata.bidirectional(unichr)
    unicodedata.east_asian_width(chr)

    The abbreviated name of the specific category is returned. However, for certain applications it is important to be able to get the from abbreviated name to the long name and vice versa.

    The data needed to do this can be found at
    http://www.unicode.org/Public/UNIDATA/PropertyValueAliases.txt
    under sections
    # General_Category (gc)
    # Bidi_Class (bc)
    # East_Asian_Width (ea)
    Use only the second (abbreviated name) and third (long name) fields and ignoring other fields and possible comments.

    For general category, also support translation back and forth of the one-letter abbreviations which are groups representing two-letter general categories abbreviations with the same initial letter.

    Please extend this package with a way of translating back and forth between abbreviated name and long name for property values defined in Unicode for general category, bidirectional class and East Asian width. This functionality should be independent of retrieving the abbreviated names for Unicode character as is available now and should be accessible via separate methods or dictionaries in which developers can perform lookups themselves.

    Implementing the functionality requested in this issue allows Python developers to get from an abbreviated property value to a meaningful property value name and vice versa without having to retrieve this information from the Unicode Consortium and/or shipping this information with their code with the risk of using outdated information.

    @PanderMusubi PanderMusubi mannequin added topic-unicode type-feature A feature request or enhancement labels Dec 14, 2012
    @ezio-melotti
    Copy link
    Member

    for certain applications it is important to be able to get the from
    abbreviated name to the long name and vice versa.

    What kind of application? I have a module where I defined my own dict that maps categories with their full names, but I'm not sure this feature is common enough that should be included and maintained in the stdlib.

    If it's added, a dict is probably enough, but a script to parse the file you mentioned and update this dict should also be included.

    @PanderMusubi
    Copy link
    Mannequin Author

    PanderMusubi mannequin commented Dec 14, 2012

    I myself have a lot of Python applications that process font files and interact with fonttools and FontForge, which are both written in Python too. As you also have your own dict for this purpose and probably other people too, it would be justified to add these three small dicts in the standard lib. Especially since this package in the standard lib follows the definitions from Unicode Consortium.

    When this is shipped in one package developers will always have an in sync translation from abbreviated names to long names and vice versa. Over the last years I needed to adjust my dicts regularly for the added definitions by Unicode Consortium which are supported by stdlib.

    At the moment, translation from Unicode codes U+1234 to human-readable Unicode names and vice versa is offered at the moment. Providing human-readable names for the property values is a service of the same level and will be catering to approximately the same user group.

    If you agree that these dicts can be added I am willing to provide a script that will parse the aforementioned file.

    @terryjreedy
    Copy link
    Member

    This seems like a plausible request to me. The three dicts comprise 70 code-alias pairs. If unicodedata had a Python version (should it?), the simplest thing would be to add bididict, eawdist, and gcdict to that version (and not to the C version). I don't know how well putting dicts in C code works. A unicodealias module could be added but I do not really like that idea. I would prefer adding data attributes and correspond docs to the current module.

    Pander: submitting a proof-of-concept script that accesses and parses that url and produces ready-to-go python code like below might encourage adoption of your proposal. In any case, it would be here for others to use.

    bididict = {
        'AL': 'Arabic_Letter',
    ...
        'WS': 'White_Space',
    }

    eawdict = ...

    @PanderMusubi
    Copy link
    Mannequin Author

    PanderMusubi mannequin commented Dec 23, 2012

    Attached is the requested proof-of-concept script.

    @terryjreedy
    Copy link
    Member

    I verified that the prototype file works in 2.7.3. I rewrote it for 3.3 using a refactored approach (and discovered that the site sometimes times out).

    @ezio-melotti
    Copy link
    Member

    The script should probably be integrated in Tools/unicode/makeunicodedata.py.

    @PanderMusubi
    Copy link
    Mannequin Author

    PanderMusubi mannequin commented Jan 11, 2017

    Any updates or ideas on how to move this forward? Meanwhile, should the issue relate to version 3.6? Thanks. Ah, see also https://bugs.python.org/issue6331 please

    @serhiy-storchaka serhiy-storchaka added the 3.7 (EOL) end of life label Jan 11, 2017
    @PanderMusubi
    Copy link
    Mannequin Author

    PanderMusubi mannequin commented Jun 20, 2018

    Since June 2018, Unicode version 11.0 is out. Perhaps that could help move this forward.

    @ned-deily ned-deily added 3.8 (EOL) end of life and removed 3.7 (EOL) end of life labels Jun 20, 2018
    @gnprice
    Copy link
    Contributor

    gnprice commented Sep 20, 2019

    I've gone and implemented a version of this that's integrated into Tools/unicode/makeunicodedata.py , and into the unicodedata module. Patch attached. Demo:

    >>> import unicodedata, pprint
    >>> pprint.pprint(unicodedata.property_value_aliases)
    {'bidirectional': {'AL': ['Arabic_Letter'],
    # ...
                       'WS': ['White_Space']},
     'category': {'C': ['Other'],
    # ...
     'east_asian_width': {'A': ['Ambiguous'],
    # ...
                          'W': ['Wide']}}

    Note that the values are lists. That's because a value can have multiple aliases in addition to its "short name":

    >>> unicodedata.property_value_aliases['category'][unicodedata.category('4')]
    ['Decimal_Number', 'digit']

    This implementation also provides the reverse mapping, from an alias to the "short name":

    >>> pprint.pprint(unicodedata.property_value_by_alias)
    {'bidirectional': {'Arabic_Letter': 'AL',
    # ...

    This draft doesn't have tests or docs, but it's otherwise complete. I've posted it at this stage for feedback on a few open questions:

    • This version is in C; at import time some C code builds up the dicts, from static tables in the header generated by makeunicodedata.py . It's not *that* much code... but it sure would be more convenient to do in Python instead.

      Should the unicodedata module perhaps have a Python part? I'd be happy to go about that -- rename the existing C module to _unicodedata and add a small unicodedata.py wrapper -- if there's a feeling that it'd be a good idea. Then this could go there instead of using the C code I've just written.

    • Is this API the right one?

      • This version has e.g. unicodedata.property_value_by_alias['category']['Decimal_Number'] == 'Nd' .

      • Perhaps make category/bidirectional/east_asian_width into attributes rather than keys? So e.g. unicodedata.property_value_by_alias.category['Decimal_Number'] == 'Nd' .

      • Or: the standard says "loose matching" should be applied to these names, so e.g. 'decimal number' or 'is-decimal-number' is equivalent to 'Decimal_Number'. To accomplish that, perhaps make it not dicts at all but functions?

        So e.g. unicodedata.property_value_by_alias('decimal number') == unicodedata.property_value_by_alias('Decimal_Number') == 'Nd' .

      • There's also room for bikeshedding on the names.

    • How shall we handle ucd_3_2_0 for this feature?

      This implementation doesn't attempt to record the older version of the data. My reasoning is that because the applications of the old data are quite specific and they haven't needed this information yet, it seems unlikely anyone will ever really want to know from this module just which aliases existed already in 3.2.0 and which didn't yet.

      OTOH, as a convenience I've caused e.g. unicodedata.ucd_3_2_0.property_value_by_alias to exist, just pointing to the same object as unicodedata.property_value_by_alias . This allows unicodedata.ucd_3_2_0 to remain a near drop-in substitute for the unicodedata module itself, while minimizing the complexity it adds to the implementation.

      Might be cleanest to just leave these off of ucd_3_2_0 entirely, though. It's still easy to get at them -- just get them from the module itself -- and it makes it explicit that you're getting current rather than old data.

    @gnprice gnprice added 3.9 only security fixes and removed 3.8 (EOL) end of life labels Sep 20, 2019
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    @PanderMusubi
    Copy link

    For me, implementation C or Python doesn't matter.

    I prefer unicodedata.property_value_by_alias.category over the other option.

    As for old data, I'm more interested in the latest version.

    Dies this help? What do others think and can we move this issue forward?

    @PanderMusubi
    Copy link

    @gnprice @ezio-melotti @terryjreedy here a polite reminder after a year to move this forward.

    @terryjreedy
    Copy link
    Member

    @isidentical @ambv @vstinner Is #96954 using a DAWG for unicodedatabase names relevant to this issue.

    @gnprice If your bpo file prop-val-aliases.patch is still relevant, would you want to turn it into a GH PR, or would you rather someone else do it? I know that you are focused on zulip now.

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.9 only security fixes topic-unicode type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests

    6 participants