Skip to content

bpo-35859: fix bugs in re engine #12427

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 24 commits into from Mar 29, 2022
Merged

bpo-35859: fix bugs in re engine #12427

merged 24 commits into from Mar 29, 2022

Conversation

ghost
Copy link

@ghost ghost commented Mar 19, 2019

For detailed explanation, please view each commit message in this PR.

Suggested overall commit message:

1, Fix macro MARK_PUSH() bug.

    MARK_PUSH() didn't protect MARK-0 if it was the only available MARK.
    This bug was reported in issue35859.

2, Add some MARK protections shown in below table:

    Legend of the table:
    * No: no protection
    * L : LASTMARK_SAVE()
    * MU: MARK_PUSH() unconditionally
    * MR: MARK_PUSH() if in a repeat

                       unpatched  patched
    JUMP_MAX_UNTIL_1     No         No      # repeat body, to min limit
    JUMP_MAX_UNTIL_2     L/MU       L/MU    # repeat body, until max limit
    JUMP_MAX_UNTIL_3     No         No      # backtrack point

    JUMP_MIN_UNTIL_1     No         No      # repeat body, to min limit
    JUMP_MIN_UNTIL_2     L     ->   L/MR    # backtrack point
    JUMP_MIN_UNTIL_3     No         No      # repeat body, until max limit

    JUMP_POSS_REPEAT_1   No         No      # repeat body, to min limit
    JUMP_POSS_REPEAT_2   L/MU       L/MU    # repeat body, until max limit

    JUMP_REPEAT_ONE_1    L     ->   L/MR    # backtrack point
    JUMP_REPEAT_ONE_2    L     ->   L/MR    # backtrack point
    JUMP_MIN_REPEAT_ONE  L     ->   L/MR    # backtrack point
    JUMP_BRANCH          L/MR       L/MR    # backtrack point

    JUMP_ASSERT          No         No      # sub-pattern
    JUMP_ASSERT_NOT      No    ->   L/MR    # sub-pattern

    Fix bugs reported in issue725149, issue9134, issue35859.

https://bugs.python.org/issue35859

wjssz added 12 commits March 19, 2019 11:15
If no error occurs, when the top level call returns:

- the stack should be empty
- state->repeat should be NULL

This prevents the following commits to wreck the stack.
MARK_PUSH() didn't protect MARK0 if it was the only available mark.

This bug was reported in issue35859.
JUMP_MAX_UNTIL_3 and JUMP_MIN_UNTIL_2 are tails of a repeat.
Before the JUMPs, sre sets state->repeat to the tail's repeat object:

    state->repeat = ctx->u.rep->prev;
    DO_JUMP(JUMP_MAX_UNTIL_3, jump_max_until_3, ctx->pattern);

But after the JUMPs, it doesn't restore state->repeat if the JUMP is SUCCESS.
This won't cause a bug, because SRE_OP_REPEAT will restore state->repeat later.
But it messes up state->repeat in some levels of JUMPs.

This behavior exists since commit 29c4ba9 (2000-Aug-2), a very immature period of sre.
Now fix like this:

    state->repeat = ctx->u.rep->prev;
    DO_JUMP(JUMP_MAX_UNTIL_3, jump_max_until_3, ctx->pattern);
+   state->repeat = ctx->u.rep;
    RETURN_ON_SUCCESS(ret);
-   state->repeat = ctx->u.rep;

Then backtrack points don't need to save state->repeat to ctx->u.rep.
Otherwise, if backtrack point is inside a repeat body, after a JUMP, state->repeat may points to the tail's repeat object (NULL or different repeat object).
These bugs are due to the lack of protection before JUMPs.
This PR will add some protections shown in below table.

Legend of the table:
* No: no protection
* L : LASTMARK_SAVE()
* MU: MARK_PUSH() unconditionally
* MR: MARK_PUSH() if in a repeat

                   unpatched  patched
JUMP_MAX_UNTIL_1     No         No      # repeat body, to min limit
JUMP_MAX_UNTIL_2     L/MU       L/MU    # repeat body, until max limit
JUMP_MAX_UNTIL_3     No         No      # backtrack point

JUMP_MIN_UNTIL_1     No         No      # repeat body, to min limit
JUMP_MIN_UNTIL_2     L     ->   L/MR    # backtrack point
JUMP_MIN_UNTIL_3     No         No      # repeat body, until max limit

JUMP_REPEAT_ONE_1    L     ->   L/MR    # backtrack point
JUMP_REPEAT_ONE_2    L     ->   L/MR    # backtrack point
JUMP_MIN_REPEAT_ONE  L     ->   L/MR    # backtrack point
JUMP_BRANCH          L/MR       L/MR    # backtrack point

JUMP_ASSERT          No         No      # sub-pattern
JUMP_ASSERT_NOT      No    ->   L/MR    # sub-pattern

🔼 It's clear that every backtrack points should L/MR.

Commit 4/5/6 will add L/MR to backtrack points respectively.

JUMP_MAX_UNTIL_3 is a backtrack point as well, but this one doesn't need L/MR protection.
See this picture's explanation: https://github.com/raw/animalize/pics/master/issues/max_until.jpg

🔼 JUMP_ASSERT_NOT's situation will be explained in Commit 7.

🔼 It seems other JUMPs don't need protect.

If these JUMPs fail, either an entire match failure or can be restored by backtrack points.
- JUMP_MAX_UNTIL_1
- JUMP_MIN_UNTIL_1
- JUMP_MIN_UNTIL_3
- JUMP_ASSERT

I really can't find a test-case to break them.
If I missed something, hope Commit 11 will catch them.

🔼 Test test-cases in this change with major regex engines:

Case-A:
s = 'axxzbcz'
p = r'(?:(?:a|bc)*?(xx)??z)*'
re.match(p, s).groups()

Case-B:
s = 'xtcxyzxc'
p = r'((x|yz)+?(t)??c)*'
re.match(p, s).groups()

                  Case-A     Case-B
PHP 7.3.2         "xx"       "xyzxc","x","t"
Java 11.0.2       "xx"       "xyzxc","x","t"
Perl 5.28.1       undef [1]  "xyzxc","x",undef [1]
Ruby 2.6.1        "xx"       "xyzxc","x","t"
Go 1.12           "xx"       "xyzxc","x","t"
Rust 1.32.0       "xx"       "xyzxc","x","t"
Node.js 10.15.1   undef [1]  "xyzxc","x",undef [1]
regex 2019.3.12   "xx"       "xyzxc","x","t"
re before patch   "" [2]     "xyzxc","x","" [2]
re after patch    "xx"       "xyzxc","x","t"

[1] different policy, not bug.
[2] "" is empty str, not None.
Case-A:
s = 'aabaab'
p = r'(?:[^b]*a(?=(b)|(a))ab)*'
re.match(p, s).groups()

Case-B:
s = 'abab'
p = r'(?:[^b]*(?=(b)|(a))ab)*'
re.match(p, s).groups()

Case-C:
s = 'ab'
p = r'(ab?)*?b'
re.match(p, s).groups()

                  Case-A      Case-B      Case-C
PHP 7.3.2         NULL,"a"    NULL,"a"    "a"
Java 11.0.2       "b","a" [1] "b","a" [1] "a"
Perl 5.28.1       "b","a"     "b","a"     "a"
Ruby 2.6.1        nil,"a"     nil,"a"     "a"
Go 1.12           [2]         [2]         "a"
Rust 1.32.0       [2]         [2]         "a"
Node.js 10.15.1   undef,"a"   undef,"a"   "a"
regex 2019.3.12   None,"a"    None,"a"    "a"
re before patch   "b","a"     "b","a"     "" [3]
re after patch    None,"a"    None,"a"    "a"

[1] seems this bug: https://bugs.java.com/bugdatabase/view_bug.do?bug_id=7145888
[2] doesn't support lookaround yet.
[3] "" is empty str, not None.
Case-A:
s = 'abab'
p = r'(?:.*?(?=(a)|(b))b)*'
re.match(p, s).groups()

Case-B:
s = 'axxzaz'
p = r'(?:a*?(xx)??z)*'
re.match(p, s).groups()

                  Case-A      Case-B
PHP 7.3.2         NULL,"b"    "xx"
Java 11.0.2       "a","b" [1] "xx"
Perl 5.28.1       "a","b"     undef [2]
Ruby 2.6.1        nil,"b"     "xx"
Go 1.12           [3]         "xx"
Rust 1.32.0       [3]         "xx"
Node.js 10.15.1   undef,"b"   undef
regex 2019.3.12   None,"b"    "xx"
re before patch   "a","b"     "" [4]
re after patch    None,"b"    "xx"

[1] seems this bug: https://bugs.java.com/bugdatabase/view_bug.do?bug_id=7145888
[2] different policy, not bug.
[3] doesn't support lookaround yet.
[4] "" is empty str, not None.
JUMP_ASSERT_NOT is special, when the sub-pattern fails, the match succeeds.
Here is a detailed explanation came from Greg Chapman:

> Greg Chapman wrote in issue725149's patch:
>
> For negative assertions, if the subpattern fails, the assertion
> succeeds and we want to continue matching.  However, in failing,
> the subpattern may have left any capturing groups it contains
> in an inconsistent (partially matched) state.  For consistency,
> "unmark" all groups in the subpattern after it completes by
> restoring marks to the state they had before executing the
> subpattern.  When not in a repeat, it is sufficient simply to
> restore lastmark to its previous value (since a repeat is the
> only way to move from right to left in a pattern, without a repeat
> marks in the assertion will be greater than any so far
> encountered.

This bug was initially reported in issue725149.

Case-A:
re.match(r'(?!(..)c)', 'ab').groups()

Case-B:
re.match(r'(?:(?!(ab)c).)*', 'ab').groups()

                  Case-A    Case-B
PHP 7.3.2         NULL      NULL
Java 11.0.2       null      null
Perl 5.28.1       "ab"      "ab"
Ruby 2.6.1        nil       nil
Go 1.12           [1]       [1]
Rust 1.32.0       [1]       [1]
Node.js 10.15.1   undef     undef
regex 2019.3.12   None      None
re before patch   "ab"      "b"
re after patch    None      None

[1] doesn't support lookaround yet.
re.match(r'((?!(ab)c)(.))*', 'abab').groups()

PHP 7.3.2         "b",NULL,"b"
Java 11.0.2       "b",null,"b"
Perl 5.28.1       "b","ab","b"
Ruby 2.6.1        "b",nil,"b"
Go 1.12           [1]
Rust 1.32.0       [1]
Node.js 10.15.1   "b",undef,"b"
regex 2019.3.12   "b",None,"b"
re before patch   "b","b","b"
re after patch    "b",None,"b"

[1] doesn't support lookaround yet.
On 32 bit platform, 36 bytes -> 32 bytes.
On 64 bit platform, 72 bytes -> 64 bytes.

ctx->toplevel is used as a boolean type.
ctx->jump's possible value is below:

#define JUMP_NONE            0
#define JUMP_MAX_UNTIL_1     1
#define JUMP_MAX_UNTIL_2     2
#define JUMP_MAX_UNTIL_3     3
#define JUMP_MIN_UNTIL_1     4
#define JUMP_MIN_UNTIL_2     5
#define JUMP_MIN_UNTIL_3     6
#define JUMP_REPEAT          7
#define JUMP_REPEAT_ONE_1    8
#define JUMP_REPEAT_ONE_2    9
#define JUMP_MIN_REPEAT_ONE  10
#define JUMP_BRANCH          11
#define JUMP_ASSERT          12
#define JUMP_ASSERT_NOT      13
1,073,741,823 groups should enough for most users.

This change reduces sizeof(match_context) again:
- On 32 bit platform: 32 bytes, no change.
- On 64 bit platform: 64 bytes -> 56 bytes.

sre uses stack and match_context struct to simulate recursive call, smaller struct brings:
- deeper recursive call
- less memory consume
- less memory realloc

Here is a test, if limit the stack size (state->data_stack_base) to 1 GiB, the max available value of n is:

re.match(r'(ab)*', n * 'ab')   # need to save MARKs
72 bytes: n = 11,184,809
64 bytes: n = 12,201,610
56 bytes: n = 13,421,771

re.match(r'(?:ab)*', n * 'ab') # no need to save MARKs
72 bytes: n = 13,421,770
64 bytes: n = 14,913,079
56 bytes: n = 16,777,214
These bugs about capturing group, sometimes lead to wrong span (begin > end):

Before the fix:
>>> re.match(r'(.b|a)*?b', 'ab').span(1)
(2, 1)
>>> re.match(r'(?:(?:a|bc)*?(xx)??z)*', 'axxzbcz').span(1)
(4, 3)
When (begin > end), PyUnicode_Substring() returns an empty string silently.
This change is a safety-check, maybe it can reveal undiscovered bugs.

Check at two moments to avoid redundant check:
- getting a capture group from state object directly
- creating a Match object
1. Remove error code SRE_ERROR_RECURSION_LIMIT.

This code has not been used since commit 2cbdc2a4 (2003-Dec-14).

2. Return SRE_ERROR_MEMORY immediately when allocate memory fail.

pattern_error() function will process SRE_ERROR_MEMORY properly. (tested)
@ghost
Copy link
Author

ghost commented Apr 5, 2019

1, Perl's mate confirmed there is a bug in Perl's regex engine.

https://rt.perl.org/Public/Bug/Display.html?id=133940

Perl's behavior in these test-cases may be wrong:
In Commit 5, Case-A/Case-B
In Commit 6, Case-A

2, I forgot to test .NET Framework.

Here is the test results on .NET Framework 4.8.
(its behavior is same as the patched re module)

Commit 4:
    Case-A:
        "xx"
    Case-B:
        "xyzxc", "x", "t"

Commit 5:
    Case-A:
        "", "a"
    Case-B:
        "", "a"
    Case-C:
        "a"
        
Commit 6:
    Case-A:
        "", "b"
    Case-B:
        "xx"

Commit 7:
    Case-A:
        ""
    Case-B:
        ""

Commit 8:
    Case-A:
        "b", "", "b"

3, Commit 11 raises a RuntimeError, maybe it's too hard?

A RuntimeWarning is more soft.

I am still confident that Commit 11's alarm will not be triggered.

@serhiy-storchaka serhiy-storchaka self-requested a review April 5, 2019 12:30
@ghost
Copy link
Author

ghost commented Apr 7, 2019

@serhiy-storchaka
Understanding the context is an annoying step for review.
So you can ask any questions, if it will save your time.

@ghost
Copy link
Author

ghost commented Apr 13, 2020

Above commit resovles code conflict to the master branch.

Copy link
Author

@ghost ghost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed review comments.

Copy link
Member

@serhiy-storchaka serhiy-storchaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is an amazing work!

I have only some questions about "polishing" the code. Maybe left some refactoring to the next issue?

And would it be possible to split the test on several parts and give them meaningful names? We will add a code for atomic groups and possessive quantifiers. It is similar to the code for assertions and repetitions, so we may need to add similar changes in that code and add corresponding tests.

Comment on lines 1380 to 1381
/* should never get here */
Py_UNREACHABLE();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is reachable from JUMP_NONE.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm understanding it again.

Copy link
Author

@ghost ghost Mar 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If jump == JUMP_NONE, SRE(match) function must be at the toplevel, then ctx_pos should be -1, it will return before the switch code.

exit:
    ctx_pos = ctx->last_ctx_pos;
    jump = ctx->jump;
    DATA_POP_DISCARD(ctx);
    if (ctx_pos == -1) {   // <- return before switch code
#ifdef Py_DEBUG
        if (ctx->toplevel && ret >= 0) {
            assert(state->repeat == NULL);
            assert(state->data_stack_base == 0);
        }
#endif
        return ret;
    }
    DATA_LOOKUP_AT(SRE(match_context), ctx, ctx_pos);

    switch (jump) {
        ....
        case JUMP_NONE:
            TRACE(("|%p|%p|RETURN %zd\n", ctx->pattern,
                   ctx->ptr, ret));
            break;
    }

    /* should never get here */
    Py_UNREACHABLE();
}

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"SRE(match) function must be at the toplevel", the "toplevel" here means ctx_pos == -1.
Sorry, It's been too long.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this line is unreachable, the JUMP_NONE case is unreachable too. You should either put Py_UNREACHABLE() just after case JUMP_NONE: (maybe even replace case JUMP_NONE: with default:), or keep lines after case JUMP_NONE: and return ret just for the case of possible internal errors.

In any case this change is not related to bugs in the RE engine fixed in this PR. I suggest to restore the code and move this change to a separate PR.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In any case this change is not related to bugs in the RE engine fixed in this PR. I suggest to restore the code and move this change to a separate PR.

OK, it's a code polish, not bug fix, reverted.
I'm going to sleep, solve other problems tomorrow.

Modules/sre.h Outdated
@@ -16,12 +16,14 @@
/* size of a code word (must be unsigned short or larger, and
large enough to hold a UCS4 character) */
#define SRE_CODE Py_UCS4

/* SRE_MAXGROUPS is 1,073,741,823 */
#define SRE_MAXGROUPS (INT_MAX / 2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it changed? On 32-bit platform match->mark requires 8 bytes per group, so SRE_MAXGROUPS was 2**28-1.

Copy link
Author

@ghost ghost Mar 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Commit 10, it was changed to (INT_MAX / 2) (1,073,741,823), so the ->lastindex / ->lastmark attributes can be changed from Py_ssize_t to int32_t, which reduce the size of context struct.
edit: uint32_t -> int32_t

Modules/sre.h Outdated
Comment on lines 79 to 80
int32_t lastmark;
int32_t lastindex;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? I you want just to save some memory, it is better to left it for other issue. Some local variables can change its type too.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it can be changed it another issue.

Comment on lines 513 to 514
int8_t jump;
int8_t toplevel;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may save some memory, but wouldn't reading/writing a single byte be slower than reading/writing a machine word on modern computers?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is my reason, is it right?

  • The "cache line" of some CPUs is 64 bytes, 64 bytes are read and written at a time as a whole. So if accessing a 4-byte integer that crosses a "cache line" boundary, need to access memory twice.
  • If the integer don't cross "cache line" boundary, there should be no difference between 1 byte and 4 bytes.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reverted commit 9 and commit 10, they can be done in another issue.

@ghost
Copy link
Author

ghost commented Mar 21, 2022

And would it be possible to split the test on several parts and give them meaningful names? We will add a code for atomic groups and possessive quantifiers. It is similar to the code for assertions and repetitions, so we may need to add similar changes in that code and add corresponding tests.

After "atomic groups" and "possessive quantifiers" code is merged, I would like to review the code and to see if need to add some unit-tests.

@ghost
Copy link
Author

ghost commented Mar 22, 2022

I suggest only merging this PR to main branch, and note this in "Improved Modules" section of "What's New In Python 3.11", to prevent different results giving users surprise.

@ghost
Copy link
Author

ghost commented Mar 28, 2022

I revert all commits that unrelated to bug fixes.

I spent over a month in 2019 to looking for an example that would break JUMP_MIN_UNTIL_3, and couldn't find it.
I revisited it again in the past few days. If JUMP_MIN_UNTIL_1 doesn't need protection, JUMP_MIN_UNTIL_3 should also not need protection.

Follow-up, them can be done in other issues:

  • limit max group to 1,073,741,823, reduce sizeof(match_context).
  • issue23689, the SRE_REPEAT instances can be allocated statically.
  • code polish

@serhiy-storchaka
Copy link
Member

Great, great! Thank you again.

It seems to me, that changes in the MARK_PUSH() macro and only these changes are compatible with Perl. I think we can backport them, it is an obvious bugfix. Could you please extract these changes into a separate PR?

As for other changes, the old behavior matches the behavior of Perl and Java, so at least Python is not alone. The new behavior matches the behavior of the regex module, PHP, Ruby and Node.js (I did not check all cases) and is considered more correct. I think we should change it in 3.11. Please add an entry about this in the What's New document, section "Porting to Python 3.11".

Sorry for taking so long to review these changes.

@ghost
Copy link
Author

ghost commented Mar 29, 2022

It seems to me, that changes in the MARK_PUSH() macro and only these changes are compatible with Perl. I think we can backport them, it is an obvious bugfix. Could you please extract these changes into a separate PR?

I suggest only merge to 3.11 branch.

MARK_PUSH() macro bug has existed since Python 2.4 (year 2004):
https://github.com/python/cpython/blob/v2.4/Modules/_sre.c#L741
Another half a year is not too much. Python 3.11 final release on 2022-10-03.

Mainly because, if backport a part of these fixes, there will be three behaviors that may lead to (users & data) confusion.

Sorry for taking so long to review these changes.

This is not a matter, thanks for your review.

@ghost
Copy link
Author

ghost commented Mar 29, 2022

the old behavior matches the behavior of Perl and Java, so at least Python is not alone.

I didn't carefully looked into the bugs of Perl/Java, whether they behave exactly the same as Python. It can only be said that Perl and Java have bugs.

Copy link
Member

@serhiy-storchaka serhiy-storchaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants