Skip to content

Add unicode script info to the unicode database #50580

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
doerwalter opened this issue Jun 23, 2009 · 19 comments
Open

Add unicode script info to the unicode database #50580

doerwalter opened this issue Jun 23, 2009 · 19 comments
Labels
3.7 (EOL) end of life topic-unicode type-feature A feature request or enhancement

Comments

@doerwalter
Copy link
Contributor

BPO 6331
Nosy @malemburg, @loewis, @doerwalter, @vstinner, @benjaminp, @ezio-melotti, @akitada, @berkerpeksag, @gnprice
Files
  • unicode-script.diff
  • unicode-script-2.diff
  • unicode-script-3.diff
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = None
    created_at = <Date 2009-06-23.20:50:57.790>
    labels = ['type-feature', '3.7', 'expert-unicode']
    title = 'Add unicode script info to the unicode database'
    updated_at = <Date 2019-08-28.05:09:27.665>
    user = 'https://github.com/doerwalter'

    bugs.python.org fields:

    activity = <Date 2019-08-28.05:09:27.665>
    actor = 'Greg Price'
    assignee = 'none'
    closed = False
    closed_date = None
    closer = None
    components = ['Unicode']
    creation = <Date 2009-06-23.20:50:57.790>
    creator = 'doerwalter'
    dependencies = []
    files = ['14348', '14356', '14418']
    hgrepos = []
    issue_num = 6331
    keywords = ['patch', 'needs review']
    message_count = 17.0
    messages = ['89642', '89647', '89671', '89675', '89701', '89973', '111040', '177469', '177506', '214204', '214633', '214636', '226266', '251214', '285269', '320090', '320092']
    nosy_count = 13.0
    nosy_names = ['lemburg', 'loewis', 'doerwalter', 'vstinner', 'benjamin.peterson', 'ezio.melotti', 'akitada', 'berker.peksag', 'PanderMusubi', 'Elizacat', 'Cosimo Lupo', 'Denis Jacquerye', 'Greg Price']
    pr_nums = []
    priority = 'normal'
    resolution = None
    stage = 'patch review'
    status = 'open'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue6331'
    versions = ['Python 3.7']

    @doerwalter
    Copy link
    Contributor Author

    This patch adds a function unicodedata.script() that returns information
    about the script of the Unicode character.

    @doerwalter doerwalter added topic-unicode type-feature A feature request or enhancement labels Jun 23, 2009
    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Jun 23, 2009

    I think the patch is incorrect: the default value for the script
    property ought to be Unknown, not Common (despite UCD.html saying the
    contrary; see UTR#24 and Scripts.txt).

    I'm puzzled why you use a hard-coded list of script names. The set of
    scripts will certainly change across Unicode versions, and I think it
    would be better to learn the script names from Scripts.txt.

    Out of curiosity: how does the addition of the script property affect
    the number of distinct database records, and the total size of the database?

    I think a common application would be lower-cases script names, for more
    efficient comparison; UCD has also changed the spelling of the script
    names over time (from being all-capital before). So I propose that
    a) two functions are provided: one with the original script names, and
    one with the lower-case script names
    b) keep cached versions of interned script name strings in separate
    arrays, to avoid PyString_FromString every time.

    I'm doubtful that script names need to be provided for old database
    versions, so I would be happy to not record the script for old versions,
    and raise an exception if somebody tries to get the script for an old
    database version - surely applications of the old database records won't
    be accessing the script property, anyway.

    @doerwalter
    Copy link
    Contributor Author

    Martin v. Löwis wrote:

    Martin v. Löwis <[email protected]> added the comment:

    I think the patch is incorrect: the default value for the script
    property ought to be Unknown, not Common (despite UCD.html saying the
    contrary; see UTR#24 and Scripts.txt).

    Fixed.

    I'm puzzled why you use a hard-coded list of script names. The set of
    scripts will certainly change across Unicode versions, and I think it
    would be better to learn the script names from Scripts.txt.

    I hardcoded the list, because I saw no easy way to get the indexes
    consistent across both versions of the database.

    Out of curiosity: how does the addition of the script property affect
    the number of distinct database records, and the total size of the database?

    I'm not exactly sure how to measure this, but the length of
    _PyUnicode_Database_Records goes from 229 entries to 690 entries.

    If it's any help I can post the output of makeunicodedata.py.

    I think a common application would be lower-cases script names, for more
    efficient comparison; UCD has also changed the spelling of the script
    names over time (from being all-capital before). So I propose that
    a) two functions are provided: one with the original script names, and
    one with the lower-case script names

    It this really neccessary, if we only have one version of the database?

    b) keep cached versions of interned script name strings in separate
    arrays, to avoid PyString_FromString every time.

    Implemented.

    I'm doubtful that script names need to be provided for old database
    versions, so I would be happy to not record the script for old versions,
    and raise an exception if somebody tries to get the script for an old
    database version - surely applications of the old database records won't
    be accessing the script property, anyway.

    OK, I've removed the script_changes info for the old database. (And with
    this change the list of script names is no longer hardcoded).

    Here's a new version of the patch (unicode-script-2.diff).

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Jun 24, 2009

    > I'm puzzled why you use a hard-coded list of script names. The set of
    > scripts will certainly change across Unicode versions, and I think it
    > would be better to learn the script names from Scripts.txt.

    I hardcoded the list, because I saw no easy way to get the indexes
    consistent across both versions of the database.

    Couldn't you have a global cache, something like

    scripts = ['Unknown']
    def findscript(script):
      try:
        return scripts.index(script)
      except ValueError:
        scripts.append(script)
        return len(scripts)-1

    > Out of curiosity: how does the addition of the script property affect
    > the number of distinct database records, and the total size of the database?

    I'm not exactly sure how to measure this, but the length of
    _PyUnicode_Database_Records goes from 229 entries to 690 entries.

    I think this needs to be fixed, then - we need to study why there are
    so many new records (e.g. what script contributes most new records),
    and then look for alternatives.

    One alternative could be to create a separate Trie for scripts.

    I'd also be curious if we can increase the homogeneity of scripts
    (i.e. produce longer runs of equal scripts) if we declare that
    unassigned code points have the script that corresponds to the block
    (i.e. the script that surrounding characters have), and then only
    change it to "Unknown" at lookup time if it's unassigned.

    If it's any help I can post the output of makeunicodedata.py.

    I'd be interested in "size unicodedata.so", and how it changes.
    Perhaps the actual size increase isn't that bad.

    > a) two functions are provided: one with the original script names, and
    > one with the lower-case script names

    It this really neccessary, if we only have one version of the database?

    I don't know what this will be used for, but one application is
    certainly regular expressions. So we need an efficient test whether
    the character is in the expected script or not. It would be bad if
    such a test would have to do a .lower() on each lookup.

    @doerwalter
    Copy link
    Contributor Author

    I was comparing apples and oranges: The 229 entries for the trunk where
    for an UCS2 build (the patched version was UCS4), with UCS4 there are
    317 entries for the trunk.

    size unicodedata.o gives:

    __TEXT __DATA __OBJC others dec hex
    13622 587057 0 23811 624490 9876a

    for trunk

    and

    __TEXT __DATA __OBJC others dec hex
    17769 588817 0 24454 631040 9a100

    for the patched version.

    @doerwalter
    Copy link
    Contributor Author

    Here is a new version that includes a new function scriptl() that
    returns the script name in lowercase.

    @BreamoreBoy
    Copy link
    Mannequin

    BreamoreBoy mannequin commented Jul 21, 2010

    Could someone with unicode knowledge take this review on, given that comments have already been made and responded to?

    @PanderMusubi
    Copy link
    Mannequin

    PanderMusubi mannequin commented Dec 14, 2012

    Please, also consider reviewing functionality offered by:
    http://pypi.python.org/pypi/unicodescript/
    and
    http://pypi.python.org/pypi/unicodeblocks/
    which could be used to improve and extend the proposed patch.

    @PanderMusubi
    Copy link
    Mannequin

    PanderMusubi mannequin commented Dec 14, 2012

    The latest version of the respective sources can be found here:
    https://github.com/ConradIrwin/unicodescript
    and here:
    https://github.com/simukis/unicodeblocks

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Mar 20, 2014

    Pander: In what way would this extend or improve the current patch?

    @PanderMusubi
    Copy link
    Mannequin

    PanderMusubi mannequin commented Mar 23, 2014

    I see the patch support Unicode scripts https://en.wikipedia.org/wiki/Script_%28Unicode%29 but I am also interested in support for Unicode blocks https://en.wikipedia.org/wiki/Unicode_block

    Code for support for the latter is at https://github.com/nagisa/unicodeblocks

    I could ont quiet make out of the patch also supports Unicode blocks. If not, shoudl that be requested in a separete issue?

    Furthermore, support for Unicode scripts and blocks should be updated each time a new version of Unicode standard is published. Someone should check of the latest patch should be updated to the latest version of Unicode. Not only for this issue but for each release of PYthon.

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Mar 23, 2014

    Adding support for blocks should indeed go into a separate issue. Your code for that is not suitable, as it should integrate with the existing make_unicodedata.py script, which your code does not.

    And yes, indeed, of course, we automatically update (nearly) all data in Python automatically from the files provided by the Unicode consortium.

    @Elizacat
    Copy link
    Mannequin

    Elizacat mannequin commented Sep 2, 2014

    I think this needs to be fixed, then - we need to study why there are
    so many new records (e.g. what script contributes most new records),
    and then look for alternatives.

    The "Common" script appears to be very fragmented and may be the cause of the issues.

    One alternative could be to create a separate Trie for scripts.

    Not having seen the one in C yet, I have one written in Python, custom-made for storing the script database, based on the general idea of a range tree. It stores ranges individually straight out of Scripts.txt. The general idea is you take the average of the lower and upper bounds of a given range (they can be equal). When searching, you compare the codepoint value to the average in the present node, and use that to find which direction to search the tree in.

    Without coalescing neighbouring ranges that are the same script, I have 1,606 nodes in the tree (for Unicode 7.0, which added a lot of scripts). After coalescing, there appear to be 806 nodes.

    If anyone cares, I'll be more than happy to post code for inspection.

    I don't know what this will be used for, but one application is
    certainly regular expressions. So we need an efficient test whether
    the character is in the expected script or not. It would be bad if
    such a test would have to do a .lower() on each lookup.

    This is actually required for restriction-level detection as described in Unicode TR39, for all levels of restriction above ASCII-only (http://www.unicode.org/reports/tr39/#Restriction_Level_Detection).

    @CosimoLupo
    Copy link
    Mannequin

    CosimoLupo mannequin commented Sep 21, 2015

    I would very much like a script() function to be added to the built-in unicodedata module.
    What's the current status of this issue?
    Thanks.

    Cosimo

    @PanderMusubi
    Copy link
    Mannequin

    PanderMusubi mannequin commented Jan 11, 2017

    Any updates or ideas on how to move this forward? See also https://bugs.python.org/issue16684 Thanks.

    @serhiy-storchaka serhiy-storchaka added the 3.7 (EOL) end of life label Jan 11, 2017
    @PanderMusubi
    Copy link
    Mannequin

    PanderMusubi mannequin commented Jun 20, 2018

    Since June 2018, Unicode version 11.0 is out. Perhaps that could help move this forward.

    @vstinner
    Copy link
    Member

    Since June 2018, Unicode version 11.0 is out. Perhaps that could help move this forward.

    Python 3.7 has been upgrade to Unicode 11.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    @PanderMusubi
    Copy link

    See also the migrated issue #60888

    @PanderMusubi
    Copy link

    Polite question to move this a step forward after a year. Thanks.

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.7 (EOL) end of life topic-unicode type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests

    4 participants