-
-
Notifications
You must be signed in to change notification settings - Fork 31.9k
gh-96954: use a directed acyclic word graph for storing the unicodedata codepoint names #97906
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
this make ucd_3_2_0 correctly fail on looking up aliases
(except for places where the ints come from existing C-API functions)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still in the process of reading through the algorithm and comparing the implementation here to it. I just wanted to get those nits out of the way since otherwise my mind would uselessly keep returning to them.
also add a DEBUG flag that checks the correctness of the packed representation at unicodedata build time, using the Python variants of the lookup/inverse_lookup algorithms
Thanks @ambv, all very sensible complaints, fixed them. |
Modules/unicodedata.c
Outdated
assert(buflen >= 0); | ||
return _inverse_dawg_lookup(buffer, (unsigned int)buflen, offset); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should be able to use Py_SAFE_DOWNCAST
here (int -> unsigned int).
It looks like there's a Lib/test/test_tools/ for testing scripts in the tools directory |
ah wonderful, thanks for finding that! and with the hypothesis stubs we can even do this properly :-) |
now it's less a fixpoint and more an optimization process, ie we can stop at any point and simply get a less optimal but still correct result.
Co-authored-by: Pieter Eendebak <[email protected]>
@sweeneyde I've addressed your comments, I think. would you have some time to review the C code as well? |
Thanks, I'll take another look tonight or tomorrow. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few more subjective things, but this LGTM!
I went ahead and tried adding versions of all of the packed-bytes C functions parameterized by the packed bytes instead of hard-coded, then haphazardly added thin python wrappers testunicodedata._dawg_lookup
and unicodedata._dawg_inverse_lookup
, and threw Hypothesis at them, and I didn't find any extra corner cases, so that is another good sign.
hypothesis code
@given(st.sets(
st.text("ABCD _", min_size=1, max_size=20),
min_size=2, max_size=20)
)
def test_c_lookup(self, words0):
words = list(words0)
not_a_word = words.pop()
words.sort()
dawg = Dawg()
for i, word in enumerate(words):
dawg.insert(word, i*10)
packed, pos_to_code, reversedict = dawg.finish()
word2pos = {}
for word in words:
word = word.encode("ascii")
pos = c_lookup(packed, word)
word2pos[word] = pos
self.assertEqual(set(word2pos.values()), set(range(len(words))))
for word, pos in word2pos.items():
self.assertEqual(c_inverse_lookup(packed, pos), word)
self.assertEqual(c_lookup(packed, not_a_word.encode("ascii")), -1)
self.assertEqual(c_inverse_lookup(packed, len(words)), None)
self.assertEqual(c_inverse_lookup(packed, len(words) + 1), None)
Co-authored-by: Dennis Sweeney <[email protected]>
- rename child_count to descendant_count - rename final_edge to last_edge to reduce the confusion with "final states"
@sweeneyde thanks a lot for the thorough feedback! I adopted your suggestions, they made sense to me too. Also thanks for the extra hypothesis checks. Do you think it would make sense to push for including them in the test suite? would be a little bit annoying, because we would have to pass the packed representation everywhere, as opposed to just referring to the single global one we need outside of tests. |
The test failures look unrelated, maybe it's the same as #111644 ? |
Up to you. I don't think it's totally necessary because as you mentioned all the code is already exercised by |
Thanks for persevering, Carl Friedrich! ✨ 🍰 ✨ Also thanks for the review, @sweeneyde. |
|
s390x failure looks unrelated:
|
|
I think it is related--the buildbots labelled "installed" seem to not get dawg.py. I opened #111764 to add skip_if_missing. |
…codedata codepoint names (python#97906) Co-authored-by: Łukasz Langa <[email protected]> Co-authored-by: Pieter Eendebak <[email protected]> Co-authored-by: Dennis Sweeney <[email protected]>
…codedata codepoint names (python#97906) Co-authored-by: Łukasz Langa <[email protected]> Co-authored-by: Pieter Eendebak <[email protected]> Co-authored-by: Dennis Sweeney <[email protected]>
gh-96954: use a directed acyclic word graph/Deterministic acyclic finite state automaton/finite state transducer for storing the unicodedata codepoint names. This is the approach that PyPy recently switched to. The names are encoded into a packed string that represents the finite state machine to recognize valid names, and also to map names to an index. The packed representation can be used to match names without decompression. The same representation can be used for the inverse operation of mapping a code point to a codepoint name.
This changes reduces the size of the
unicodedata
shared library from 1222 KiB to 791 KiB.Relevant papers:
/CC @isidentical