gh-96954: use a directed acyclic word graph for storing the unicodedata codepoint names #97906

cfbolz · 2022-10-05T13:23:08Z

gh-96954: use a directed acyclic word graph/Deterministic acyclic finite state automaton/finite state transducer for storing the unicodedata codepoint names. This is the approach that PyPy recently switched to. The names are encoded into a packed string that represents the finite state machine to recognize valid names, and also to map names to an index. The packed representation can be used to match names without decompression. The same representation can be used for the inverse operation of mapping a code point to a codepoint name.

This changes reduces the size of the unicodedata shared library from 1222 KiB to 791 KiB.

Relevant papers:

https://www.ic.unicamp.br/~reltech/1992/92-01.pdf
http://www.cs.put.poznan.pl/dweiss/site/publications/download/fsacomp.pdf (but I didn't follow the approach of not representing states explicitly, that was consistently worse in my experiments).

/CC @isidentical

…ode names

this make ucd_3_2_0 correctly fail on looking up aliases

Modules/unicodedata.c

(except for places where the ints come from existing C-API functions)

ambv

I'm still in the process of reading through the algorithm and comparing the implementation here to it. I just wanted to get those nits out of the way since otherwise my mind would uselessly keep returning to them.

Tools/unicode/dawg.py

also add a DEBUG flag that checks the correctness of the packed representation at unicodedata build time, using the Python variants of the lookup/inverse_lookup algorithms

cfbolz · 2022-10-06T10:01:17Z

I'm still in the process of reading through the algorithm and comparing the implementation here to it. I just wanted to get those nits out of the way since otherwise my mind would uselessly keep returning to them.

Thanks @ambv, all very sensible complaints, fixed them.

isidentical · 2022-10-09T18:06:38Z

Modules/unicodedata.c

+    assert(buflen >= 0);
+    return _inverse_dawg_lookup(buffer, (unsigned int)buflen, offset);


You should be able to use Py_SAFE_DOWNCAST here (int -> unsigned int).

Modules/unicodedata.c

Tools/unicode/dawg.py

sweeneyde · 2023-10-22T17:38:15Z

It looks like there's a Lib/test/test_tools/ for testing scripts in the tools directory

cfbolz · 2023-10-23T05:46:46Z

ah wonderful, thanks for finding that! and with the hypothesis stubs we can even do this properly :-)

Modules/unicodedata.c

now it's less a fixpoint and more an optimization process, ie we can stop at any point and simply get a less optimal but still correct result.

Co-authored-by: Pieter Eendebak <[email protected]>

cfbolz · 2023-10-30T09:12:06Z

@sweeneyde I've addressed your comments, I think. would you have some time to review the C code as well?

sweeneyde · 2023-10-31T20:31:53Z

Thanks, I'll take another look tonight or tomorrow.

sweeneyde

A few more subjective things, but this LGTM!

I went ahead and tried adding versions of all of the packed-bytes C functions parameterized by the packed bytes instead of hard-coded, then haphazardly added thin python wrappers testunicodedata._dawg_lookup and unicodedata._dawg_inverse_lookup, and threw Hypothesis at them, and I didn't find any extra corner cases, so that is another good sign.

hypothesis code

    @given(st.sets(
            st.text("ABCD _", min_size=1, max_size=20),
            min_size=2, max_size=20)
    )
    def test_c_lookup(self, words0):
        words = list(words0)
        not_a_word = words.pop()
        words.sort()
        dawg = Dawg()
        for i, word in enumerate(words):
            dawg.insert(word, i*10)
        packed, pos_to_code, reversedict = dawg.finish()

        word2pos = {}
        for word in words:
            word = word.encode("ascii")
            pos = c_lookup(packed, word)
            word2pos[word] = pos
        self.assertEqual(set(word2pos.values()), set(range(len(words))))
        for word, pos in word2pos.items():
            self.assertEqual(c_inverse_lookup(packed, pos), word)
        self.assertEqual(c_lookup(packed, not_a_word.encode("ascii")), -1)

        self.assertEqual(c_inverse_lookup(packed, len(words)), None)
        self.assertEqual(c_inverse_lookup(packed, len(words) + 1), None)

Modules/unicodedata.c

Tools/unicode/dawg.py

Modules/unicodedata.c

Co-authored-by: Dennis Sweeney <[email protected]>

- rename child_count to descendant_count - rename final_edge to last_edge to reduce the confusion with "final states"

cfbolz · 2023-11-03T12:29:59Z

@sweeneyde thanks a lot for the thorough feedback! I adopted your suggestions, they made sense to me too.

Also thanks for the extra hypothesis checks. Do you think it would make sense to push for including them in the test suite? would be a little bit annoying, because we would have to pass the packed representation everywhere, as opposed to just referring to the single global one we need outside of tests.

cfbolz · 2023-11-03T12:32:04Z

The test failures look unrelated, maybe it's the same as #111644 ?

sweeneyde · 2023-11-03T17:06:57Z

Also thanks for the extra hypothesis checks. Do you think it would make sense to push for including them in the test suite?

Up to you. I don't think it's totally necessary because as you mentioned all the code is already exercised by test_unicodedata.

ambv · 2023-11-04T14:57:47Z

Thanks for persevering, Carl Friedrich! ✨ 🍰 ✨

Also thanks for the review, @sweeneyde.

bedevere-bot · 2023-11-04T15:47:47Z

⚠️⚠️⚠️ Buildbot failure ⚠️⚠️⚠️

Hi! The buildbot s390x Fedora Clang Installed 3.x has failed when building commit 9573d14.

What do you need to do:

Don't panic.
Check the buildbot page in the devguide if you don't know what the buildbots are or how they work.
Go to the page of the buildbot that failed (https://buildbot.python.org/all/#builders/531/builds/4757) and take a look at the build logs.
Check if the failure is related to this commit (9573d14) or if it is a false positive.
If the failure is related to this commit, please, reflect that on the issue and make a new Pull Request with a fix.

You can take a look at the buildbot page here:

https://buildbot.python.org/all/#builders/531/builds/4757

Failed tests:

test_tools

Summary of the results of the build (if available):

==

Click to see traceback logs

Traceback (most recent call last):
  File "/home/dje/cpython-buildarea/3.x.edelsohn-fedora-z.clang-installed/build/target/lib/python3.13/test/libregrtest/single.py", line 179, in _runtest_env_changed_exc
    _load_run_test(result, runtests)
  File "/home/dje/cpython-buildarea/3.x.edelsohn-fedora-z.clang-installed/build/target/lib/python3.13/test/libregrtest/single.py", line 136, in _load_run_test
    regrtest_runner(result, test_func, runtests)
  File "/home/dje/cpython-buildarea/3.x.edelsohn-fedora-z.clang-installed/build/target/lib/python3.13/test/libregrtest/single.py", line 89, in regrtest_runner
    test_result = test_func()
                  ^^^^^^^^^^^
  File "/home/dje/cpython-buildarea/3.x.edelsohn-fedora-z.clang-installed/build/target/lib/python3.13/test/libregrtest/single.py", line 133, in test_func
    return run_unittest(test_mod)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dje/cpython-buildarea/3.x.edelsohn-fedora-z.clang-installed/build/target/lib/python3.13/test/libregrtest/single.py", line 36, in run_unittest
    raise Exception("errors while loading tests")
Exception: errors while loading tests


Traceback (most recent call last):
  File "/home/dje/cpython-buildarea/3.x.edelsohn-fedora-z.clang-installed/build/target/lib/python3.13/unittest/loader.py", line 394, in _find_test_path
    module = self._get_module_from_name(name)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dje/cpython-buildarea/3.x.edelsohn-fedora-z.clang-installed/build/target/lib/python3.13/unittest/loader.py", line 337, in _get_module_from_name
    __import__(name)
  File "/home/dje/cpython-buildarea/3.x.edelsohn-fedora-z.clang-installed/build/target/lib/python3.13/test/test_tools/test_makeunicodedata.py", line 12, in <module>
    from dawg import Dawg, build_compression_dawg, lookup, inverse_lookup
ModuleNotFoundError: No module named 'dawg'

ambv · 2023-11-04T17:47:29Z

s390x failure looks unrelated:

  File "/home/dje/cpython-buildarea/3.x.edelsohn-fedora-z.clang-installed/build/target/lib/python3.13/test/libregrtest/single.py", line 36, in run_unittest
    raise Exception("errors while loading tests")

bedevere-bot · 2023-11-04T17:53:32Z

⚠️⚠️⚠️ Buildbot failure ⚠️⚠️⚠️

Hi! The buildbot s390x Fedora Clang Installed 3.x has failed when building commit 9573d14.

What do you need to do:

Don't panic.
Check the buildbot page in the devguide if you don't know what the buildbots are or how they work.
Go to the page of the buildbot that failed (https://buildbot.python.org/all/#builders/531/builds/4758) and take a look at the build logs.
Check if the failure is related to this commit (9573d14) or if it is a false positive.
If the failure is related to this commit, please, reflect that on the issue and make a new Pull Request with a fix.

You can take a look at the buildbot page here:

https://buildbot.python.org/all/#builders/531/builds/4758

Failed tests:

test_tools

Summary of the results of the build (if available):

==

Click to see traceback logs

Traceback (most recent call last):
  File "/home/dje/cpython-buildarea/3.x.edelsohn-fedora-z.clang-installed/build/target/lib/python3.13/test/libregrtest/single.py", line 179, in _runtest_env_changed_exc
    _load_run_test(result, runtests)
  File "/home/dje/cpython-buildarea/3.x.edelsohn-fedora-z.clang-installed/build/target/lib/python3.13/test/libregrtest/single.py", line 136, in _load_run_test
    regrtest_runner(result, test_func, runtests)
  File "/home/dje/cpython-buildarea/3.x.edelsohn-fedora-z.clang-installed/build/target/lib/python3.13/test/libregrtest/single.py", line 89, in regrtest_runner
    test_result = test_func()
                  ^^^^^^^^^^^
  File "/home/dje/cpython-buildarea/3.x.edelsohn-fedora-z.clang-installed/build/target/lib/python3.13/test/libregrtest/single.py", line 133, in test_func
    return run_unittest(test_mod)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dje/cpython-buildarea/3.x.edelsohn-fedora-z.clang-installed/build/target/lib/python3.13/test/libregrtest/single.py", line 36, in run_unittest
    raise Exception("errors while loading tests")
Exception: errors while loading tests


Traceback (most recent call last):
  File "/home/dje/cpython-buildarea/3.x.edelsohn-fedora-z.clang-installed/build/target/lib/python3.13/unittest/loader.py", line 394, in _find_test_path
    module = self._get_module_from_name(name)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dje/cpython-buildarea/3.x.edelsohn-fedora-z.clang-installed/build/target/lib/python3.13/unittest/loader.py", line 337, in _get_module_from_name
    __import__(name)
  File "/home/dje/cpython-buildarea/3.x.edelsohn-fedora-z.clang-installed/build/target/lib/python3.13/test/test_tools/test_makeunicodedata.py", line 12, in <module>
    from dawg import Dawg, build_compression_dawg, lookup, inverse_lookup
ModuleNotFoundError: No module named 'dawg'

sweeneyde · 2023-11-06T01:07:51Z

I think it is related--the buildbots labelled "installed" seem to not get dawg.py.

I opened #111764 to add skip_if_missing.

…codedata codepoint names (python#97906) Co-authored-by: Łukasz Langa <[email protected]> Co-authored-by: Pieter Eendebak <[email protected]> Co-authored-by: Dennis Sweeney <[email protected]>

cfbolz added 16 commits September 20, 2022 14:02

intermediate commit: start working on porting pypy dawg to store unic…

0fb902f

…ode names

start porting lookup to the compact DAWG

80a8470

implement inverse lookup

8e22650

cleanup script

ab8ee48

refactor a bit

d72faa3

add comments, some cleanup

4478542

fix inverse lookup

c307224

check that names are actually shorter than NAME_MAXLEN

2f663b2

explain why we can't use topological sorting

1102b6a

use topological sorting, found a way to do it

fba25e3

Merge remote-tracking branch 'origin/main' into unicodenames-dawg

330ef08

fix: make sure that looking up lower case unicode names works

7bccc32

update comment

4e4eb7a

3.2.0 doesn't have aliases or named sequences

b7ad39e

move the handling of from _getcode into the two callers

a577528

this make ucd_3_2_0 correctly fail on looking up aliases

blurb

489b1d5

isidentical self-requested a review October 5, 2022 13:24

maybe fix ReST

999c106

bedevere-bot added the awaiting review label Oct 5, 2022

vstinner reviewed Oct 5, 2022

View reviewed changes

Modules/unicodedata.c Outdated Show resolved Hide resolved

cfbolz added 3 commits October 5, 2022 18:16

fix whitespace, don't know what happened here

a7742e3

consistently use unsigned ints everywhere

28102ad

(except for places where the ints come from existing C-API functions)

small simplification

2ec1438

ambv reviewed Oct 5, 2022

View reviewed changes

do a cleanup pass after Łukasz feedback

c5039c5

also add a DEBUG flag that checks the correctness of the packed representation at unicodedata build time, using the Python variants of the lookup/inverse_lookup algorithms

isidentical reviewed Oct 9, 2022

View reviewed changes

Modules/unicodedata.c Outdated Show resolved Hide resolved

isidentical reviewed Oct 9, 2022

View reviewed changes

Modules/unicodedata.c Outdated Show resolved Hide resolved

isidentical reviewed Oct 9, 2022

View reviewed changes

Tools/unicode/dawg.py Outdated Show resolved Hide resolved

cfbolz added 3 commits October 22, 2023 13:19

more sensible __hash__ and __eq__

57af105

improve comments and docstrings

3148c85

rename confusingly named argument

a6deecf

add unittests for the dawg code

cc3600f

eendebakpt reviewed Oct 23, 2023

View reviewed changes

Modules/unicodedata.c Outdated Show resolved Hide resolved

cfbolz and others added 2 commits October 23, 2023 10:19

fix fixpoint bug

cccf356

now it's less a fixpoint and more an optimization process, ie we can stop at any point and simply get a less optimal but still correct result.

fix comment

776bd1a

Co-authored-by: Pieter Eendebak <[email protected]>

sweeneyde approved these changes Nov 3, 2023

View reviewed changes

Modules/unicodedata.c Outdated Show resolved Hide resolved

Tools/unicode/dawg.py Outdated Show resolved Hide resolved

Tools/unicode/dawg.py Show resolved Hide resolved

Modules/unicodedata.c Outdated Show resolved Hide resolved

Modules/unicodedata.c Show resolved Hide resolved

bedevere-app bot added awaiting merge and removed awaiting review labels Nov 3, 2023

cfbolz and others added 3 commits November 3, 2023 09:22

fix typo

1cb4a69

Co-authored-by: Dennis Sweeney <[email protected]>

clearer names

d49e871

- rename child_count to descendant_count - rename final_edge to last_edge to reduce the confusion with "final states"

check for empty dawg

dfb0580

ambv merged commit 9573d14 into python:main Nov 4, 2023

bedevere-app bot removed the awaiting merge label Nov 4, 2023

hugovk mentioned this pull request Feb 4, 2025

[3.12] gh-126524: Run regen-unicodedata as a part of our CI (GH-126682) #127595

Closed

		assert(buflen >= 0);
		return _inverse_dawg_lookup(buffer, (unsigned int)buflen, offset);

Uh oh!

gh-96954: use a directed acyclic word graph for storing the unicodedata codepoint names #97906

gh-96954: use a directed acyclic word graph for storing the unicodedata codepoint names #97906

Uh oh!

Conversation

cfbolz commented Oct 5, 2022

Uh oh!

Uh oh!

ambv left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cfbolz commented Oct 6, 2022

Uh oh!

isidentical Oct 9, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sweeneyde commented Oct 22, 2023

Uh oh!

cfbolz commented Oct 23, 2023

Uh oh!

Uh oh!

cfbolz commented Oct 30, 2023

Uh oh!

sweeneyde commented Oct 31, 2023

Uh oh!

sweeneyde left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cfbolz commented Nov 3, 2023

Uh oh!

cfbolz commented Nov 3, 2023

Uh oh!

sweeneyde commented Nov 3, 2023

Uh oh!

ambv commented Nov 4, 2023

Uh oh!

bedevere-bot commented Nov 4, 2023

⚠️⚠️⚠️ Buildbot failure ⚠️⚠️⚠️

Uh oh!

ambv commented Nov 4, 2023

Uh oh!

bedevere-bot commented Nov 4, 2023

⚠️⚠️⚠️ Buildbot failure ⚠️⚠️⚠️

Uh oh!

sweeneyde commented Nov 6, 2023

Uh oh!

Uh oh!