Improve regex compiler / source generator for sets #84370

stephentoub · 2023-04-05T18:51:51Z

Reduce RegexCompiler cost of using IndexOfAnyValues. With the source generator, each IndexOfAnyValues is stored in its own static readonly field. This makes it cheap to access and allows the JIT to devirtualize calls to it. With RegexCompiler, we use a DynamicMethod and thus can't introduce new static fields, so instead we maintain an array of IndexOfAnyValues. That means that every time we need one, we're loading the object out of the array. This incurs both bounds checks and doesn't devirtualize. This PR changes the implementation to avoid the bounds check and to also enable devirtualization.
When we compute the starting set for a search, we see if we can easily extract from the set a few characters that completely compose it, but we were ignoring it if it was a negative set. Now that we have IndexOfAnyExcept, that's no longer necessary, and with the recent changes around IndexOfAnyValues, we would actually end up doing the slower thing of using IndexOfAnyValues even if it'll end up using one of the dedicated IndexOfAnyExcept overloads under the covers (at run time with the source generator it shouldn't be more expensive, but with RegexCompiler it is). This fixes the implementation to no longer ignore the negative case.
For loops and lazy loops, we use {Last}IndexOfAny{Except} methods to more quickly search through the loop, e.g. finding the next thing that can't be part of the loop, or when backtracking finding the next location to backtrack to. We now do so using IndexOfAnyValues when the other APIs aren't applicable. This means we no longer need a "TryEmitIndexOf"; we can always emit an IndexOf variant for any one/notone/set/multi.
When an expression begins with an appropriate loop, we insert an "update bumpalong" node after it to help avoid redoing work we've already done and that can't possibly be successful. However, if the loop fails midway due to not meeting its minimum, we never get to the code that would perform that bumpalong. This restructures things slightly so such an early exit can also benefit. In doing this, I removed the manual loop unrolling that was being done for small repeaters, as each iteration would have needed its own early exit; now that all versions of that can use an IndexOf variant, the manual loop unrolling is no longer as useful.
Improve regex source gen IndexOfAny naming for Unicode categories. When we're otherwise unable to come up with a good name for the custom IndexOfAny helper, if the set is just a handful of UnicodeCategory values, derive a name from those categories.
Avoid using a IndexOf for the any set. We needn't search for anything, as everything matches.
Remove categories from a set whose ranges make it already complete (when there's no subtraction). We have code paths that explicitly recognize the Any char class, and these extra categories knock these sets off those fast paths. And remove categories from a set where a single char is missing from the ranges, by checking whether that char is contained in the categories. If the char is present, the set can be morphed into Any. If the char isn't present, the categories can be removed and the set becomes a standard NotOne form. Both of these are unlikely to be written explicitly by a developer but result from analysis producing search sets, in particular when alternations or nullable loops are involved. Also fixed textual description of sets that both contain the last character (\uFFFF) and have categories. We were sometimes skipping the first category in this case. This is only relevant to the source generator, as these descriptions are output in comments.

ghost · 2023-04-05T18:52:06Z

Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions
See info in area-owners.md if you want to be subscribed.

Issue Details

Reduce RegexCompiler cost of using IndexOfAnyValues. With the source generator, each IndexOfAnyValues is stored in its own static readonly field. This makes it cheap to access and allows the JIT to devirtualize calls to it. With RegexCompiler, we use a DynamicMethod and thus can't introduce new static fields, so instead we maintain an array of IndexOfAnyValues. That means that every time we need one, we're loading the object out of the array. This incurs both bounds checks and doesn't devirtualize. This PR changes the implementation to avoid the bounds check and to also enable devirtualization.
When we compute the starting set for a search, we see if we can easily extract from the set a few characters that completely compose it, but we were ignoring it if it was a negative set. Now that we have IndexOfAnyExcept, that's no longer necessary, and with the recent changes around IndexOfAnyValues, we would actually end up doing the slower thing of using IndexOfAnyValues even if it'll end up using one of the dedicated IndexOfAnyExcept overloads under the covers (at run time with the source generator it shouldn't be more expensive, but with RegexCompiler it is). This fixes the implementation to no longer ignore the negative case.
For loops and lazy loops, we use {Last}IndexOfAny{Except} methods to more quickly search through the loop, e.g. finding the next thing that can't be part of the loop, or when backtracking finding the next location to backtrack to. We now do so using IndexOfAnyValues when the other APIs aren't applicable. This means we no longer need a "TryEmitIndexOf"; we can always emit an IndexOf variant for any one/notone/set/multi.
When an expression begins with an appropriate loop, we insert an "update bumpalong" node after it to help avoid redoing work we've already done and that can't possibly be successful. However, if the loop fails midway due to not meeting its minimum, we never get to the code that would perform that bumpalong. This restructures things slightly so such an early exit can also benefit. In doing this, I removed the manual loop unrolling that was being done for small repeaters, as each iteration would have needed its own early exit; now that all versions of that can use an IndexOf variant, the manual loop unrolling is no longer as useful.
Improve regex source gen IndexOfAny naming for Unicode categories. When we're otherwise unable to come up with a good name for the custom IndexOfAny helper, if the set is just a handful of UnicodeCategory values, derive a name from those categories.
Avoid using a IndexOf for the any set. We needn't search for anything, as everything matches.
Remove categories from a set whose ranges make it already complete (when there's no subtraction). We have code paths that explicitly recognize the Any char class, and these extra categories knock these sets off those fast paths. And remove categories from a set where a single char is missing from the ranges, by checking whether that char is contained in the categories. If the char is present, the set can be morphed into Any. If the char isn't present, the categories can be removed and the set becomes a standard NotOne form. Both of these are unlikely to be written explicitly by a developer but result from analysis producing search sets, in particular when alternations or nullable loops are involved. Also fixed textual description of sets that both contain the last character (\uFFFF) and have categories. We were sometimes skipping the first category in this case. This is only relevant to the source generator, as these descriptions are output in comments.

Fixes #84150
Fixes #84149
Fixes #84139

Author:	stephentoub
Assignees:	-
Labels:	`area-System.Text.RegularExpressions`
Milestone:	8.0.0

...libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexCompiler.cs

stephentoub · 2023-04-07T00:50:42Z

@joperezr, could you help review this when you get a chance? Thanks.

joperezr · 2023-04-07T01:11:05Z

Sorry that I haven't gotten the chance yet. I'll take care of this first thing in the morning.

...libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexCompiler.cs

.../System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexFindOptimizations.cs

...ies/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexPrefixAnalyzer.cs

src/libraries/System.Text.RegularExpressions/tests/FunctionalTests/RegexGeneratorOutputTests.cs

joperezr

LGTM. Did you notice any significant gains after this in our benchmarks?

- Remove categories from a set whose ranges make it already complete (when there's no subtraction). We have code paths that explicitly recognize the Any char class, and these extra categories knock these sets off those fast paths. - Remove categories from a set where a single char is missing from the ranges, by checking whether that char is contained in the categories. If the char is present, the set can be morphed into Any. If the char isn't present, the categories can be removed and the set becomes a standard NotOne form. Both of these are unlikely to be written explicitly by a developer but result from analysis producing search sets, in particular when alternations or nullable loops are involved. Also fixed textual description of sets that both contain the last character (\uFFFF) and have categories. We were sometimes skipping the first category in this case. This is only relevant to the source generator, as these descriptions are output in comments.

We needn't search for anything, as everything matches.

stephentoub · 2023-04-10T13:46:44Z

I'm removing the commit that makes more extensive use of IndexOfAnyValues, as it's not always having the desired impact. I'll do some more work on it and submit it separately.

When we're otherwise unable to come up with a good name for the custom IndexOfAny helper, if the set is just a handful of UnicodeCategory values, derive a name from those categories.

With the source generator, each IndexOfAnyValues is stored in its own static readonly field. This makes it cheap to access and allows the JIT to devirtualize calls to it. With RegexCompiler, we use a DynamicMethod and thus can't introduce new static fields, so instead we maintain an array of IndexOfAnyValues. That means that every time we need one, we're loading the object out of the array. This incurs both bounds checks and doesn't devirtualize. This commit changes the implementation to avoid the bounds check and to also enable devirtualization.

uweigand · 2023-05-09T12:56:57Z

Hi @stephentoub , we're now seeing failures in the System.Text.RegularExpressions.Tests.RegexMatchTests.Match_TestThatTimeoutHappens test case on s390x, see e.g. here:
https://helixre107v0xdeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-heads-main-fb56ebe47ad240c79f/System.Text.RegularExpressions.Tests/1/console.86e3fbb5.log?helixlogtype=result

Unfortunately, these seem to be nondeterministic - in some CI runs the test passes, in others it fails. I'm not sure if this related to this PR or any of your other recent RegEx changes in the first place, but I cannot recall seeing this particular failure before about mid-April. When running the test locally on my system, so far I was completely unable to reproduce the failure, so I'm not sure how to start debugging this problem.

If we see the failure, it seems to be in either the Compiler or SourceGenerator flavor of the test case. Also, given the total run time shown in the CI logs, the problem doesn't appear to be that the regex test runs a long time and the timeout just doesn't trigger, but rather that the test completes quickly. (I don't know if the regex also matches correctly or not - the test doesn't seem to verify that.)

Do you have any ideas what could cause this failure? Or any suggestions on how to debug this? Thanks for your help!

stephentoub · 2023-07-05T04:00:12Z

@uweigand, sorry for the delay in responding.

Are you still seeing this?

but rather that the test completes quickly

Right, this test is validating that when the processing of the regex takes long enough, a timeout exception gets thrown. I can't see the failure you cited anymore, but presumably the processing of the regex just happened so fast that it didn't time out and thus the test failed. I don't have an explanation for why that would be, though.

DrewScoggins · 2023-08-25T22:15:53Z

We are still seeing a regression in this particular scenario on Windows x64.

stephentoub added the area-System.Text.RegularExpressions label Apr 5, 2023

stephentoub added this to the 8.0.0 milestone Apr 5, 2023

stephentoub requested review from joperezr and MihaZupan April 5, 2023 18:51

ghost assigned stephentoub Apr 5, 2023

stephentoub mentioned this pull request Apr 5, 2023

[Perf] Linux/arm64: 1 Regression on 3/28/2023 10:24:48 AM dotnet/perf-autofiling-issues#14907

Closed

MihaZupan reviewed Apr 5, 2023

View reviewed changes

...libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexCompiler.cs Outdated Show resolved Hide resolved

EgorBo mentioned this pull request Apr 5, 2023

Unsafe.As doesn't update type information in JIT #84377

Closed

stephentoub force-pushed the regexindexofany2 branch from 9109bb7 to 5bea37f Compare April 5, 2023 20:53

build-analysis bot mentioned this pull request Apr 6, 2023

Checkout failure: "Git fetch failed with exit code 128" dotnet/arcade#9009

Open

2 tasks

build-analysis bot mentioned this pull request Apr 7, 2023

WindowsServiceLifetimeTests.ServiceCanStopItself fails in CI #84438

Closed

joperezr reviewed Apr 7, 2023

View reviewed changes

...libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexCompiler.cs Outdated Show resolved Hide resolved

joperezr reviewed Apr 7, 2023

View reviewed changes

.../System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexFindOptimizations.cs Outdated Show resolved Hide resolved

joperezr reviewed Apr 7, 2023

View reviewed changes

...ies/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexPrefixAnalyzer.cs Outdated Show resolved Hide resolved

joperezr reviewed Apr 7, 2023

View reviewed changes

src/libraries/System.Text.RegularExpressions/tests/FunctionalTests/RegexGeneratorOutputTests.cs Outdated Show resolved Hide resolved

joperezr approved these changes Apr 7, 2023

View reviewed changes

stephentoub added 3 commits April 10, 2023 09:40

Fix downlevel builds with a project reference to regex generator

c6515d0

Avoid using a IndexOf for the any set

88d3f41

We needn't search for anything, as everything matches.

stephentoub added 2 commits April 10, 2023 09:48

Improve regex source gen IndexOfAny naming for Unicode categories

7df8852

When we're otherwise unable to come up with a good name for the custom IndexOfAny helper, if the set is just a handful of UnicodeCategory values, derive a name from those categories.

stephentoub force-pushed the regexindexofany2 branch from 6ac789e to ceb7175 Compare April 10, 2023 14:17

stephentoub merged commit 00dbf84 into dotnet:main Apr 10, 2023

stephentoub deleted the regexindexofany2 branch April 10, 2023 16:31

MihaZupan mentioned this pull request May 10, 2023

Reduce overhead of calling into SearchValues #86046

Merged

ghost locked as resolved and limited conversation to collaborators Jun 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve regex compiler / source generator for sets #84370

Improve regex compiler / source generator for sets #84370

Uh oh!

stephentoub commented Apr 5, 2023

Uh oh!

ghost commented Apr 5, 2023

Uh oh!

Uh oh!

stephentoub commented Apr 7, 2023

Uh oh!

joperezr commented Apr 7, 2023

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

joperezr left a comment

Uh oh!

stephentoub commented Apr 10, 2023

Uh oh!

uweigand commented May 9, 2023

Uh oh!

stephentoub commented Jul 5, 2023

Uh oh!

DrewScoggins commented Aug 25, 2023

Uh oh!

Uh oh!

Improve regex compiler / source generator for sets #84370

Improve regex compiler / source generator for sets #84370

Uh oh!

Conversation

stephentoub commented Apr 5, 2023

Uh oh!

ghost commented Apr 5, 2023

Uh oh!

Uh oh!

stephentoub commented Apr 7, 2023

Uh oh!

joperezr commented Apr 7, 2023

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

joperezr left a comment

Choose a reason for hiding this comment

Uh oh!

stephentoub commented Apr 10, 2023

Uh oh!

uweigand commented May 9, 2023

Uh oh!

stephentoub commented Jul 5, 2023

Uh oh!

DrewScoggins commented Aug 25, 2023

Uh oh!

Uh oh!