Skip to content

Conversation

stephentoub
Copy link
Member

  • Reduce RegexCompiler cost of using IndexOfAnyValues. With the source generator, each IndexOfAnyValues is stored in its own static readonly field. This makes it cheap to access and allows the JIT to devirtualize calls to it. With RegexCompiler, we use a DynamicMethod and thus can't introduce new static fields, so instead we maintain an array of IndexOfAnyValues. That means that every time we need one, we're loading the object out of the array. This incurs both bounds checks and doesn't devirtualize. This PR changes the implementation to avoid the bounds check and to also enable devirtualization.

  • When we compute the starting set for a search, we see if we can easily extract from the set a few characters that completely compose it, but we were ignoring it if it was a negative set. Now that we have IndexOfAnyExcept, that's no longer necessary, and with the recent changes around IndexOfAnyValues, we would actually end up doing the slower thing of using IndexOfAnyValues even if it'll end up using one of the dedicated IndexOfAnyExcept overloads under the covers (at run time with the source generator it shouldn't be more expensive, but with RegexCompiler it is). This fixes the implementation to no longer ignore the negative case.

  • For loops and lazy loops, we use {Last}IndexOfAny{Except} methods to more quickly search through the loop, e.g. finding the next thing that can't be part of the loop, or when backtracking finding the next location to backtrack to. We now do so using IndexOfAnyValues when the other APIs aren't applicable. This means we no longer need a "TryEmitIndexOf"; we can always emit an IndexOf variant for any one/notone/set/multi.

  • When an expression begins with an appropriate loop, we insert an "update bumpalong" node after it to help avoid redoing work we've already done and that can't possibly be successful. However, if the loop fails midway due to not meeting its minimum, we never get to the code that would perform that bumpalong. This restructures things slightly so such an early exit can also benefit. In doing this, I removed the manual loop unrolling that was being done for small repeaters, as each iteration would have needed its own early exit; now that all versions of that can use an IndexOf variant, the manual loop unrolling is no longer as useful.

  • Improve regex source gen IndexOfAny naming for Unicode categories. When we're otherwise unable to come up with a good name for the custom IndexOfAny helper, if the set is just a handful of UnicodeCategory values, derive a name from those categories.

  • Avoid using a IndexOf for the any set. We needn't search for anything, as everything matches.

  • Remove categories from a set whose ranges make it already complete (when there's no subtraction). We have code paths that explicitly recognize the Any char class, and these extra categories knock these sets off those fast paths. And remove categories from a set where a single char is missing from the ranges, by checking whether that char is contained in the categories. If the char is present, the set can be morphed into Any. If the char isn't present, the categories can be removed and the set becomes a standard NotOne form. Both of these are unlikely to be written explicitly by a developer but result from analysis producing search sets, in particular when alternations or nullable loops are involved. Also fixed textual description of sets that both contain the last character (\uFFFF) and have categories. We were sometimes skipping the first category in this case. This is only relevant to the source generator, as these descriptions are output in comments.

Fixes #84150
Fixes #84149
Fixes #84139

@ghost
Copy link

ghost commented Apr 5, 2023

Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions
See info in area-owners.md if you want to be subscribed.

Issue Details
  • Reduce RegexCompiler cost of using IndexOfAnyValues. With the source generator, each IndexOfAnyValues is stored in its own static readonly field. This makes it cheap to access and allows the JIT to devirtualize calls to it. With RegexCompiler, we use a DynamicMethod and thus can't introduce new static fields, so instead we maintain an array of IndexOfAnyValues. That means that every time we need one, we're loading the object out of the array. This incurs both bounds checks and doesn't devirtualize. This PR changes the implementation to avoid the bounds check and to also enable devirtualization.

  • When we compute the starting set for a search, we see if we can easily extract from the set a few characters that completely compose it, but we were ignoring it if it was a negative set. Now that we have IndexOfAnyExcept, that's no longer necessary, and with the recent changes around IndexOfAnyValues, we would actually end up doing the slower thing of using IndexOfAnyValues even if it'll end up using one of the dedicated IndexOfAnyExcept overloads under the covers (at run time with the source generator it shouldn't be more expensive, but with RegexCompiler it is). This fixes the implementation to no longer ignore the negative case.

  • For loops and lazy loops, we use {Last}IndexOfAny{Except} methods to more quickly search through the loop, e.g. finding the next thing that can't be part of the loop, or when backtracking finding the next location to backtrack to. We now do so using IndexOfAnyValues when the other APIs aren't applicable. This means we no longer need a "TryEmitIndexOf"; we can always emit an IndexOf variant for any one/notone/set/multi.

  • When an expression begins with an appropriate loop, we insert an "update bumpalong" node after it to help avoid redoing work we've already done and that can't possibly be successful. However, if the loop fails midway due to not meeting its minimum, we never get to the code that would perform that bumpalong. This restructures things slightly so such an early exit can also benefit. In doing this, I removed the manual loop unrolling that was being done for small repeaters, as each iteration would have needed its own early exit; now that all versions of that can use an IndexOf variant, the manual loop unrolling is no longer as useful.

  • Improve regex source gen IndexOfAny naming for Unicode categories. When we're otherwise unable to come up with a good name for the custom IndexOfAny helper, if the set is just a handful of UnicodeCategory values, derive a name from those categories.

  • Avoid using a IndexOf for the any set. We needn't search for anything, as everything matches.

  • Remove categories from a set whose ranges make it already complete (when there's no subtraction). We have code paths that explicitly recognize the Any char class, and these extra categories knock these sets off those fast paths. And remove categories from a set where a single char is missing from the ranges, by checking whether that char is contained in the categories. If the char is present, the set can be morphed into Any. If the char isn't present, the categories can be removed and the set becomes a standard NotOne form. Both of these are unlikely to be written explicitly by a developer but result from analysis producing search sets, in particular when alternations or nullable loops are involved. Also fixed textual description of sets that both contain the last character (\uFFFF) and have categories. We were sometimes skipping the first category in this case. This is only relevant to the source generator, as these descriptions are output in comments.

Fixes #84150
Fixes #84149
Fixes #84139

Author: stephentoub
Assignees: -
Labels:

area-System.Text.RegularExpressions

Milestone: 8.0.0

@stephentoub
Copy link
Member Author

@joperezr, could you help review this when you get a chance? Thanks.

@joperezr
Copy link
Member

joperezr commented Apr 7, 2023

Sorry that I haven't gotten the chance yet. I'll take care of this first thing in the morning.

Copy link
Member

@joperezr joperezr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Did you notice any significant gains after this in our benchmarks?

- Remove categories from a set whose ranges make it already complete (when there's no subtraction).  We have code paths that explicitly recognize the Any char class, and these extra categories knock these sets off those fast paths.
- Remove categories from a set where a single char is missing from the ranges, by checking whether that char is contained in the categories.  If the char is present, the set can be morphed into Any.  If the char isn't present, the categories can be removed and the set becomes a standard NotOne form.

Both of these are unlikely to be written explicitly by a developer but result from analysis producing search sets, in particular when alternations or nullable loops are involved.

Also fixed textual description of sets that both contain the last character (\uFFFF) and have categories.  We were sometimes skipping the first category in this case.  This is only relevant to the source generator, as these descriptions are output in comments.
We needn't search for anything, as everything matches.
@stephentoub
Copy link
Member Author

I'm removing the commit that makes more extensive use of IndexOfAnyValues, as it's not always having the desired impact. I'll do some more work on it and submit it separately.

When we're otherwise unable to come up with a good name for the custom IndexOfAny helper, if the set is just a handful of UnicodeCategory values, derive a name from those categories.
With the source generator, each IndexOfAnyValues is stored in its own static readonly field.  This makes it cheap to access and allows the JIT to devirtualize calls to it.

With RegexCompiler, we use a DynamicMethod and thus can't introduce new static fields, so instead we maintain an array of IndexOfAnyValues.  That means that every time we need one, we're loading the object out of the array.  This incurs both bounds checks and doesn't devirtualize.

This commit changes the implementation to avoid the bounds check and to also enable devirtualization.
@stephentoub stephentoub merged commit 00dbf84 into dotnet:main Apr 10, 2023
@stephentoub stephentoub deleted the regexindexofany2 branch April 10, 2023 16:31
@uweigand
Copy link
Contributor

uweigand commented May 9, 2023

Hi @stephentoub , we're now seeing failures in the System.Text.RegularExpressions.Tests.RegexMatchTests.Match_TestThatTimeoutHappens test case on s390x, see e.g. here:
https://helixre107v0xdeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-heads-main-fb56ebe47ad240c79f/System.Text.RegularExpressions.Tests/1/console.86e3fbb5.log?helixlogtype=result

Unfortunately, these seem to be nondeterministic - in some CI runs the test passes, in others it fails. I'm not sure if this related to this PR or any of your other recent RegEx changes in the first place, but I cannot recall seeing this particular failure before about mid-April. When running the test locally on my system, so far I was completely unable to reproduce the failure, so I'm not sure how to start debugging this problem.

If we see the failure, it seems to be in either the Compiler or SourceGenerator flavor of the test case. Also, given the total run time shown in the CI logs, the problem doesn't appear to be that the regex test runs a long time and the timeout just doesn't trigger, but rather that the test completes quickly. (I don't know if the regex also matches correctly or not - the test doesn't seem to verify that.)

Do you have any ideas what could cause this failure? Or any suggestions on how to debug this? Thanks for your help!

@ghost ghost locked as resolved and limited conversation to collaborators Jun 8, 2023
@stephentoub
Copy link
Member Author

@uweigand, sorry for the delay in responding.

Are you still seeing this?

but rather that the test completes quickly

Right, this test is validating that when the processing of the regex takes long enough, a timeout exception gets thrown. I can't see the failure you cited anymore, but presumably the processing of the regex just happened so fast that it didn't time out and thus the test failed. I don't have an explanation for why that would be, though.

@DrewScoggins
Copy link
Member

We are still seeing a regression in this particular scenario on Windows x64.

image

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.