-
-
Notifications
You must be signed in to change notification settings - Fork 31.9k
bpo-35859: fix bugs in re engine #12427
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
If no error occurs, when the top level call returns: - the stack should be empty - state->repeat should be NULL This prevents the following commits to wreck the stack.
MARK_PUSH() didn't protect MARK0 if it was the only available mark. This bug was reported in issue35859.
JUMP_MAX_UNTIL_3 and JUMP_MIN_UNTIL_2 are tails of a repeat. Before the JUMPs, sre sets state->repeat to the tail's repeat object: state->repeat = ctx->u.rep->prev; DO_JUMP(JUMP_MAX_UNTIL_3, jump_max_until_3, ctx->pattern); But after the JUMPs, it doesn't restore state->repeat if the JUMP is SUCCESS. This won't cause a bug, because SRE_OP_REPEAT will restore state->repeat later. But it messes up state->repeat in some levels of JUMPs. This behavior exists since commit 29c4ba9 (2000-Aug-2), a very immature period of sre. Now fix like this: state->repeat = ctx->u.rep->prev; DO_JUMP(JUMP_MAX_UNTIL_3, jump_max_until_3, ctx->pattern); + state->repeat = ctx->u.rep; RETURN_ON_SUCCESS(ret); - state->repeat = ctx->u.rep; Then backtrack points don't need to save state->repeat to ctx->u.rep. Otherwise, if backtrack point is inside a repeat body, after a JUMP, state->repeat may points to the tail's repeat object (NULL or different repeat object).
These bugs are due to the lack of protection before JUMPs. This PR will add some protections shown in below table. Legend of the table: * No: no protection * L : LASTMARK_SAVE() * MU: MARK_PUSH() unconditionally * MR: MARK_PUSH() if in a repeat unpatched patched JUMP_MAX_UNTIL_1 No No # repeat body, to min limit JUMP_MAX_UNTIL_2 L/MU L/MU # repeat body, until max limit JUMP_MAX_UNTIL_3 No No # backtrack point JUMP_MIN_UNTIL_1 No No # repeat body, to min limit JUMP_MIN_UNTIL_2 L -> L/MR # backtrack point JUMP_MIN_UNTIL_3 No No # repeat body, until max limit JUMP_REPEAT_ONE_1 L -> L/MR # backtrack point JUMP_REPEAT_ONE_2 L -> L/MR # backtrack point JUMP_MIN_REPEAT_ONE L -> L/MR # backtrack point JUMP_BRANCH L/MR L/MR # backtrack point JUMP_ASSERT No No # sub-pattern JUMP_ASSERT_NOT No -> L/MR # sub-pattern 🔼 It's clear that every backtrack points should L/MR. Commit 4/5/6 will add L/MR to backtrack points respectively. JUMP_MAX_UNTIL_3 is a backtrack point as well, but this one doesn't need L/MR protection. See this picture's explanation: https://github.com/raw/animalize/pics/master/issues/max_until.jpg 🔼 JUMP_ASSERT_NOT's situation will be explained in Commit 7. 🔼 It seems other JUMPs don't need protect. If these JUMPs fail, either an entire match failure or can be restored by backtrack points. - JUMP_MAX_UNTIL_1 - JUMP_MIN_UNTIL_1 - JUMP_MIN_UNTIL_3 - JUMP_ASSERT I really can't find a test-case to break them. If I missed something, hope Commit 11 will catch them. 🔼 Test test-cases in this change with major regex engines: Case-A: s = 'axxzbcz' p = r'(?:(?:a|bc)*?(xx)??z)*' re.match(p, s).groups() Case-B: s = 'xtcxyzxc' p = r'((x|yz)+?(t)??c)*' re.match(p, s).groups() Case-A Case-B PHP 7.3.2 "xx" "xyzxc","x","t" Java 11.0.2 "xx" "xyzxc","x","t" Perl 5.28.1 undef [1] "xyzxc","x",undef [1] Ruby 2.6.1 "xx" "xyzxc","x","t" Go 1.12 "xx" "xyzxc","x","t" Rust 1.32.0 "xx" "xyzxc","x","t" Node.js 10.15.1 undef [1] "xyzxc","x",undef [1] regex 2019.3.12 "xx" "xyzxc","x","t" re before patch "" [2] "xyzxc","x","" [2] re after patch "xx" "xyzxc","x","t" [1] different policy, not bug. [2] "" is empty str, not None.
Case-A: s = 'aabaab' p = r'(?:[^b]*a(?=(b)|(a))ab)*' re.match(p, s).groups() Case-B: s = 'abab' p = r'(?:[^b]*(?=(b)|(a))ab)*' re.match(p, s).groups() Case-C: s = 'ab' p = r'(ab?)*?b' re.match(p, s).groups() Case-A Case-B Case-C PHP 7.3.2 NULL,"a" NULL,"a" "a" Java 11.0.2 "b","a" [1] "b","a" [1] "a" Perl 5.28.1 "b","a" "b","a" "a" Ruby 2.6.1 nil,"a" nil,"a" "a" Go 1.12 [2] [2] "a" Rust 1.32.0 [2] [2] "a" Node.js 10.15.1 undef,"a" undef,"a" "a" regex 2019.3.12 None,"a" None,"a" "a" re before patch "b","a" "b","a" "" [3] re after patch None,"a" None,"a" "a" [1] seems this bug: https://bugs.java.com/bugdatabase/view_bug.do?bug_id=7145888 [2] doesn't support lookaround yet. [3] "" is empty str, not None.
Case-A: s = 'abab' p = r'(?:.*?(?=(a)|(b))b)*' re.match(p, s).groups() Case-B: s = 'axxzaz' p = r'(?:a*?(xx)??z)*' re.match(p, s).groups() Case-A Case-B PHP 7.3.2 NULL,"b" "xx" Java 11.0.2 "a","b" [1] "xx" Perl 5.28.1 "a","b" undef [2] Ruby 2.6.1 nil,"b" "xx" Go 1.12 [3] "xx" Rust 1.32.0 [3] "xx" Node.js 10.15.1 undef,"b" undef regex 2019.3.12 None,"b" "xx" re before patch "a","b" "" [4] re after patch None,"b" "xx" [1] seems this bug: https://bugs.java.com/bugdatabase/view_bug.do?bug_id=7145888 [2] different policy, not bug. [3] doesn't support lookaround yet. [4] "" is empty str, not None.
JUMP_ASSERT_NOT is special, when the sub-pattern fails, the match succeeds. Here is a detailed explanation came from Greg Chapman: > Greg Chapman wrote in issue725149's patch: > > For negative assertions, if the subpattern fails, the assertion > succeeds and we want to continue matching. However, in failing, > the subpattern may have left any capturing groups it contains > in an inconsistent (partially matched) state. For consistency, > "unmark" all groups in the subpattern after it completes by > restoring marks to the state they had before executing the > subpattern. When not in a repeat, it is sufficient simply to > restore lastmark to its previous value (since a repeat is the > only way to move from right to left in a pattern, without a repeat > marks in the assertion will be greater than any so far > encountered. This bug was initially reported in issue725149. Case-A: re.match(r'(?!(..)c)', 'ab').groups() Case-B: re.match(r'(?:(?!(ab)c).)*', 'ab').groups() Case-A Case-B PHP 7.3.2 NULL NULL Java 11.0.2 null null Perl 5.28.1 "ab" "ab" Ruby 2.6.1 nil nil Go 1.12 [1] [1] Rust 1.32.0 [1] [1] Node.js 10.15.1 undef undef regex 2019.3.12 None None re before patch "ab" "b" re after patch None None [1] doesn't support lookaround yet.
re.match(r'((?!(ab)c)(.))*', 'abab').groups() PHP 7.3.2 "b",NULL,"b" Java 11.0.2 "b",null,"b" Perl 5.28.1 "b","ab","b" Ruby 2.6.1 "b",nil,"b" Go 1.12 [1] Rust 1.32.0 [1] Node.js 10.15.1 "b",undef,"b" regex 2019.3.12 "b",None,"b" re before patch "b","b","b" re after patch "b",None,"b" [1] doesn't support lookaround yet.
On 32 bit platform, 36 bytes -> 32 bytes. On 64 bit platform, 72 bytes -> 64 bytes. ctx->toplevel is used as a boolean type. ctx->jump's possible value is below: #define JUMP_NONE 0 #define JUMP_MAX_UNTIL_1 1 #define JUMP_MAX_UNTIL_2 2 #define JUMP_MAX_UNTIL_3 3 #define JUMP_MIN_UNTIL_1 4 #define JUMP_MIN_UNTIL_2 5 #define JUMP_MIN_UNTIL_3 6 #define JUMP_REPEAT 7 #define JUMP_REPEAT_ONE_1 8 #define JUMP_REPEAT_ONE_2 9 #define JUMP_MIN_REPEAT_ONE 10 #define JUMP_BRANCH 11 #define JUMP_ASSERT 12 #define JUMP_ASSERT_NOT 13
1,073,741,823 groups should enough for most users. This change reduces sizeof(match_context) again: - On 32 bit platform: 32 bytes, no change. - On 64 bit platform: 64 bytes -> 56 bytes. sre uses stack and match_context struct to simulate recursive call, smaller struct brings: - deeper recursive call - less memory consume - less memory realloc Here is a test, if limit the stack size (state->data_stack_base) to 1 GiB, the max available value of n is: re.match(r'(ab)*', n * 'ab') # need to save MARKs 72 bytes: n = 11,184,809 64 bytes: n = 12,201,610 56 bytes: n = 13,421,771 re.match(r'(?:ab)*', n * 'ab') # no need to save MARKs 72 bytes: n = 13,421,770 64 bytes: n = 14,913,079 56 bytes: n = 16,777,214
These bugs about capturing group, sometimes lead to wrong span (begin > end): Before the fix: >>> re.match(r'(.b|a)*?b', 'ab').span(1) (2, 1) >>> re.match(r'(?:(?:a|bc)*?(xx)??z)*', 'axxzbcz').span(1) (4, 3) When (begin > end), PyUnicode_Substring() returns an empty string silently. This change is a safety-check, maybe it can reveal undiscovered bugs. Check at two moments to avoid redundant check: - getting a capture group from state object directly - creating a Match object
1. Remove error code SRE_ERROR_RECURSION_LIMIT. This code has not been used since commit 2cbdc2a4 (2003-Dec-14). 2. Return SRE_ERROR_MEMORY immediately when allocate memory fail. pattern_error() function will process SRE_ERROR_MEMORY properly. (tested)
1, Perl's mate confirmed there is a bug in Perl's regex engine.https://rt.perl.org/Public/Bug/Display.html?id=133940 Perl's behavior in these test-cases may be wrong: 2, I forgot to test .NET Framework.Here is the test results on
3, Commit 11 raises a RuntimeError, maybe it's too hard?A RuntimeWarning is more soft. I am still confident that Commit 11's alarm will not be triggered. |
@serhiy-storchaka |
Above commit resovles code conflict to the master branch. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed review comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is an amazing work!
I have only some questions about "polishing" the code. Maybe left some refactoring to the next issue?
And would it be possible to split the test on several parts and give them meaningful names? We will add a code for atomic groups and possessive quantifiers. It is similar to the code for assertions and repetitions, so we may need to add similar changes in that code and add corresponding tests.
Modules/sre_lib.h
Outdated
/* should never get here */ | ||
Py_UNREACHABLE(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is reachable from JUMP_NONE.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm understanding it again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If jump == JUMP_NONE
, SRE(match)
function must be at the toplevel, then ctx_pos
should be -1
, it will return before the switch
code.
exit:
ctx_pos = ctx->last_ctx_pos;
jump = ctx->jump;
DATA_POP_DISCARD(ctx);
if (ctx_pos == -1) { // <- return before switch code
#ifdef Py_DEBUG
if (ctx->toplevel && ret >= 0) {
assert(state->repeat == NULL);
assert(state->data_stack_base == 0);
}
#endif
return ret;
}
DATA_LOOKUP_AT(SRE(match_context), ctx, ctx_pos);
switch (jump) {
....
case JUMP_NONE:
TRACE(("|%p|%p|RETURN %zd\n", ctx->pattern,
ctx->ptr, ret));
break;
}
/* should never get here */
Py_UNREACHABLE();
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"SRE(match) function must be at the toplevel", the "toplevel" here means ctx_pos == -1
.
Sorry, It's been too long.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this line is unreachable, the JUMP_NONE case is unreachable too. You should either put Py_UNREACHABLE()
just after case JUMP_NONE:
(maybe even replace case JUMP_NONE:
with default:
), or keep lines after case JUMP_NONE:
and return ret
just for the case of possible internal errors.
In any case this change is not related to bugs in the RE engine fixed in this PR. I suggest to restore the code and move this change to a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In any case this change is not related to bugs in the RE engine fixed in this PR. I suggest to restore the code and move this change to a separate PR.
OK, it's a code polish, not bug fix, reverted.
I'm going to sleep, solve other problems tomorrow.
Modules/sre.h
Outdated
@@ -16,12 +16,14 @@ | |||
/* size of a code word (must be unsigned short or larger, and | |||
large enough to hold a UCS4 character) */ | |||
#define SRE_CODE Py_UCS4 | |||
|
|||
/* SRE_MAXGROUPS is 1,073,741,823 */ | |||
#define SRE_MAXGROUPS (INT_MAX / 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is it changed? On 32-bit platform match->mark
requires 8 bytes per group, so SRE_MAXGROUPS was 2**28-1
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Commit 10
, it was changed to (INT_MAX / 2)
(1,073,741,823), so the ->lastindex
/ ->lastmark
attributes can be changed from Py_ssize_t
to int32_t
, which reduce the size of context struct.
edit: uint32_t
-> int32_t
Modules/sre.h
Outdated
int32_t lastmark; | ||
int32_t lastindex; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why? I you want just to save some memory, it is better to left it for other issue. Some local variables can change its type too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it can be changed it another issue.
Modules/sre_lib.h
Outdated
int8_t jump; | ||
int8_t toplevel; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may save some memory, but wouldn't reading/writing a single byte be slower than reading/writing a machine word on modern computers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is my reason, is it right?
- The "cache line" of some CPUs is 64 bytes, 64 bytes are read and written at a time as a whole. So if accessing a 4-byte integer that crosses a "cache line" boundary, need to access memory twice.
- If the integer don't cross "cache line" boundary, there should be no difference between 1 byte and 4 bytes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reverted commit 9
and commit 10
, they can be done in another issue.
Misc/NEWS.d/next/Library/2019-03-14-09-08-25.bpo-35859.8lFdLe.rst
Outdated
Show resolved
Hide resolved
After "atomic groups" and "possessive quantifiers" code is merged, I would like to review the code and to see if need to add some unit-tests. |
This reverts commit f738245.
I suggest only merging this PR to |
I revert all commits that unrelated to bug fixes. I spent over a month in 2019 to looking for an example that would break Follow-up, them can be done in other issues:
|
Great, great! Thank you again. It seems to me, that changes in the MARK_PUSH() macro and only these changes are compatible with Perl. I think we can backport them, it is an obvious bugfix. Could you please extract these changes into a separate PR? As for other changes, the old behavior matches the behavior of Perl and Java, so at least Python is not alone. The new behavior matches the behavior of the regex module, PHP, Ruby and Node.js (I did not check all cases) and is considered more correct. I think we should change it in 3.11. Please add an entry about this in the What's New document, section "Porting to Python 3.11". Sorry for taking so long to review these changes. |
I suggest only merge to 3.11 branch.
Mainly because, if backport a part of these fixes, there will be three behaviors that may lead to (users & data) confusion.
This is not a matter, thanks for your review. |
I didn't carefully looked into the bugs of Perl/Java, whether they behave exactly the same as Python. It can only be said that Perl and Java have bugs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
For detailed explanation, please view each commit message in this PR.
Suggested overall commit message:
https://bugs.python.org/issue35859