-
-
Notifications
You must be signed in to change notification settings - Fork 31.9k
Add unicode script info to the unicode database #50580
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
This patch adds a function unicodedata.script() that returns information |
I think the patch is incorrect: the default value for the script I'm puzzled why you use a hard-coded list of script names. The set of Out of curiosity: how does the addition of the script property affect I think a common application would be lower-cases script names, for more I'm doubtful that script names need to be provided for old database |
Martin v. Löwis wrote:
Fixed.
I hardcoded the list, because I saw no easy way to get the indexes
I'm not exactly sure how to measure this, but the length of If it's any help I can post the output of makeunicodedata.py.
It this really neccessary, if we only have one version of the database?
Implemented.
OK, I've removed the script_changes info for the old database. (And with Here's a new version of the patch (unicode-script-2.diff). |
Couldn't you have a global cache, something like scripts = ['Unknown']
def findscript(script):
try:
return scripts.index(script)
except ValueError:
scripts.append(script)
return len(scripts)-1
I think this needs to be fixed, then - we need to study why there are One alternative could be to create a separate Trie for scripts. I'd also be curious if we can increase the homogeneity of scripts
I'd be interested in "size unicodedata.so", and how it changes.
I don't know what this will be used for, but one application is |
I was comparing apples and oranges: The 229 entries for the trunk where size unicodedata.o gives: __TEXT __DATA __OBJC others dec hex for trunk and __TEXT __DATA __OBJC others dec hex for the patched version. |
Here is a new version that includes a new function scriptl() that |
Could someone with unicode knowledge take this review on, given that comments have already been made and responded to? |
Please, also consider reviewing functionality offered by: |
The latest version of the respective sources can be found here: |
Pander: In what way would this extend or improve the current patch? |
I see the patch support Unicode scripts https://en.wikipedia.org/wiki/Script_%28Unicode%29 but I am also interested in support for Unicode blocks https://en.wikipedia.org/wiki/Unicode_block Code for support for the latter is at https://github.com/nagisa/unicodeblocks I could ont quiet make out of the patch also supports Unicode blocks. If not, shoudl that be requested in a separete issue? Furthermore, support for Unicode scripts and blocks should be updated each time a new version of Unicode standard is published. Someone should check of the latest patch should be updated to the latest version of Unicode. Not only for this issue but for each release of PYthon. |
Adding support for blocks should indeed go into a separate issue. Your code for that is not suitable, as it should integrate with the existing make_unicodedata.py script, which your code does not. And yes, indeed, of course, we automatically update (nearly) all data in Python automatically from the files provided by the Unicode consortium. |
The "Common" script appears to be very fragmented and may be the cause of the issues.
Not having seen the one in C yet, I have one written in Python, custom-made for storing the script database, based on the general idea of a range tree. It stores ranges individually straight out of Scripts.txt. The general idea is you take the average of the lower and upper bounds of a given range (they can be equal). When searching, you compare the codepoint value to the average in the present node, and use that to find which direction to search the tree in. Without coalescing neighbouring ranges that are the same script, I have 1,606 nodes in the tree (for Unicode 7.0, which added a lot of scripts). After coalescing, there appear to be 806 nodes. If anyone cares, I'll be more than happy to post code for inspection.
This is actually required for restriction-level detection as described in Unicode TR39, for all levels of restriction above ASCII-only (http://www.unicode.org/reports/tr39/#Restriction_Level_Detection). |
I would very much like a Cosimo |
Any updates or ideas on how to move this forward? See also https://bugs.python.org/issue16684 Thanks. |
Since June 2018, Unicode version 11.0 is out. Perhaps that could help move this forward. |
Python 3.7 has been upgrade to Unicode 11. |
See also the migrated issue #60888 |
Polite question to move this a step forward after a year. Thanks. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: