-
Notifications
You must be signed in to change notification settings - Fork 5.2k
Improve regex compiler / source generator for sets #84370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions Issue Details
Fixes #84150
|
...libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexCompiler.cs
Outdated
Show resolved
Hide resolved
9109bb7
to
5bea37f
Compare
@joperezr, could you help review this when you get a chance? Thanks. |
Sorry that I haven't gotten the chance yet. I'll take care of this first thing in the morning. |
...libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexCompiler.cs
Outdated
Show resolved
Hide resolved
.../System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexFindOptimizations.cs
Outdated
Show resolved
Hide resolved
...ies/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexPrefixAnalyzer.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Text.RegularExpressions/tests/FunctionalTests/RegexGeneratorOutputTests.cs
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Did you notice any significant gains after this in our benchmarks?
- Remove categories from a set whose ranges make it already complete (when there's no subtraction). We have code paths that explicitly recognize the Any char class, and these extra categories knock these sets off those fast paths. - Remove categories from a set where a single char is missing from the ranges, by checking whether that char is contained in the categories. If the char is present, the set can be morphed into Any. If the char isn't present, the categories can be removed and the set becomes a standard NotOne form. Both of these are unlikely to be written explicitly by a developer but result from analysis producing search sets, in particular when alternations or nullable loops are involved. Also fixed textual description of sets that both contain the last character (\uFFFF) and have categories. We were sometimes skipping the first category in this case. This is only relevant to the source generator, as these descriptions are output in comments.
We needn't search for anything, as everything matches.
I'm removing the commit that makes more extensive use of IndexOfAnyValues, as it's not always having the desired impact. I'll do some more work on it and submit it separately. |
When we're otherwise unable to come up with a good name for the custom IndexOfAny helper, if the set is just a handful of UnicodeCategory values, derive a name from those categories.
With the source generator, each IndexOfAnyValues is stored in its own static readonly field. This makes it cheap to access and allows the JIT to devirtualize calls to it. With RegexCompiler, we use a DynamicMethod and thus can't introduce new static fields, so instead we maintain an array of IndexOfAnyValues. That means that every time we need one, we're loading the object out of the array. This incurs both bounds checks and doesn't devirtualize. This commit changes the implementation to avoid the bounds check and to also enable devirtualization.
6ac789e
to
ceb7175
Compare
Hi @stephentoub , we're now seeing failures in the Unfortunately, these seem to be nondeterministic - in some CI runs the test passes, in others it fails. I'm not sure if this related to this PR or any of your other recent RegEx changes in the first place, but I cannot recall seeing this particular failure before about mid-April. When running the test locally on my system, so far I was completely unable to reproduce the failure, so I'm not sure how to start debugging this problem. If we see the failure, it seems to be in either the Do you have any ideas what could cause this failure? Or any suggestions on how to debug this? Thanks for your help! |
@uweigand, sorry for the delay in responding. Are you still seeing this?
Right, this test is validating that when the processing of the regex takes long enough, a timeout exception gets thrown. I can't see the failure you cited anymore, but presumably the processing of the regex just happened so fast that it didn't time out and thus the test failed. I don't have an explanation for why that would be, though. |
Reduce RegexCompiler cost of using IndexOfAnyValues. With the source generator, each IndexOfAnyValues is stored in its own static readonly field. This makes it cheap to access and allows the JIT to devirtualize calls to it. With RegexCompiler, we use a DynamicMethod and thus can't introduce new static fields, so instead we maintain an array of IndexOfAnyValues. That means that every time we need one, we're loading the object out of the array. This incurs both bounds checks and doesn't devirtualize. This PR changes the implementation to avoid the bounds check and to also enable devirtualization.
When we compute the starting set for a search, we see if we can easily extract from the set a few characters that completely compose it, but we were ignoring it if it was a negative set. Now that we have IndexOfAnyExcept, that's no longer necessary, and with the recent changes around IndexOfAnyValues, we would actually end up doing the slower thing of using IndexOfAnyValues even if it'll end up using one of the dedicated IndexOfAnyExcept overloads under the covers (at run time with the source generator it shouldn't be more expensive, but with RegexCompiler it is). This fixes the implementation to no longer ignore the negative case.
For loops and lazy loops, we use {Last}IndexOfAny{Except} methods to more quickly search through the loop, e.g. finding the next thing that can't be part of the loop, or when backtracking finding the next location to backtrack to. We now do so using IndexOfAnyValues when the other APIs aren't applicable. This means we no longer need a "TryEmitIndexOf"; we can always emit an IndexOf variant for any one/notone/set/multi.
When an expression begins with an appropriate loop, we insert an "update bumpalong" node after it to help avoid redoing work we've already done and that can't possibly be successful. However, if the loop fails midway due to not meeting its minimum, we never get to the code that would perform that bumpalong. This restructures things slightly so such an early exit can also benefit. In doing this, I removed the manual loop unrolling that was being done for small repeaters, as each iteration would have needed its own early exit; now that all versions of that can use an IndexOf variant, the manual loop unrolling is no longer as useful.
Improve regex source gen IndexOfAny naming for Unicode categories. When we're otherwise unable to come up with a good name for the custom IndexOfAny helper, if the set is just a handful of UnicodeCategory values, derive a name from those categories.
Avoid using a IndexOf for the any set. We needn't search for anything, as everything matches.
Remove categories from a set whose ranges make it already complete (when there's no subtraction). We have code paths that explicitly recognize the Any char class, and these extra categories knock these sets off those fast paths. And remove categories from a set where a single char is missing from the ranges, by checking whether that char is contained in the categories. If the char is present, the set can be morphed into Any. If the char isn't present, the categories can be removed and the set becomes a standard NotOne form. Both of these are unlikely to be written explicitly by a developer but result from analysis producing search sets, in particular when alternations or nullable loops are involved. Also fixed textual description of sets that both contain the last character (\uFFFF) and have categories. We were sometimes skipping the first category in this case. This is only relevant to the source generator, as these descriptions are output in comments.
Fixes #84150
Fixes #84149
Fixes #84139