Skip to content

Regex optimization opportunities #1349

Closed
@stephentoub

Description

@stephentoub

We’ve made some very measurable improvement to Regex post-.NET Core 3.1 (e.g. #271, #449, #542, #1348), but there’s still a lot of opportunity for incremental improvement (we can also consider a more foundational re-implementation if desired). Having spent some time over the holidays looking at the implementation, here’s a brain dump of some ideas that can be explored:

Vectorization / faster searching:

  • Vectorize and/or remove Boyer-Moore. We should explore whether a simple String/Span.IndexOf(char), which is now vectorized, will end up yielding greater benefits than the Boyer-Moore implementation currently employed. Alternatively, we should at least look at augmenting the implemention with a String/Span.IndexOf(char) to take advantage of its vectorization when searching for the next starting point. We can then also potentially add additional support for searching for multiple substrings in a vectorized manner. Rust’s regex engine does this with the Teddy algorithm.
  • Improve Boyer-Moore's prefix. If we keep Boyer-Moore, it can be significantly improved. In particular, to find the string that it looks for, the RegexFCD.Prefix method is called on the RegexNode tree to find the prefix string to search for, and as the comment on the method says “It's quite trivial and gives up easily.” For example, it’s common to see a regex like \bsomeword\b, but the moment RegexFCD.Prefix sees that \b boundary, it gives up.
  • Use Boyer-Moore for internal literals. We currently only look for strings starting the regex with which to use Boyer-Moore. But if the regex doesn't begin with a literal but has one internally that's always a fixed position from its start, e.g. [abc]def, we could still use Boyer-Moore with "def" and then back up by the offset for the actual matching engine.
  • Use Aho–Corasick for FindFirstChar. We currently use Boyer-Moore when there's a single prefix to be searched for, e.g. with a regex like "abcd(efg|hij)". But if we instead have a regex like "(efg|hij)abcd", we fall back to doing a search for "[eh]". With Aho-Corasick, we could instead do a search for both "efg" and "hij" concurrently, or even "efgabcd" and "hijabcd". It can also be vectorized to some extent.
  • Improve what we can vectorize in FindFirstChar. Given a regex like (hi|hello) there, we’ll do a vectorized search for h, but given [\w]+, we’ll iterate through the characters one-by-one doing a bit vector lookup on each character to see whether it’s a word character. We might be able to take advantage of some of the newer SSE/AVX instrinsics to do this faster.
  • Add notoneloop{Atomic} search to backtracking compiler. In the non-backtracking compiler, we use span.IndexOf to search for the character. We can do the same with string.IndexOf in the full-featured compiler.
  • Add setloop IndexOfAny search to non-backtracking compiler. When we encounter a notoneloopatomic in the non-backtracking compiler, we emit a IndexOf to search for the target character. We could do the same with IndexOfAny(char, char) and IndexOfAny(char, char, char) when we have a setloopatomic around a set with only two or three negated chars.
  • Unroll more or less. We currently unroll the comparison for a string, e.g. the text hello is matched with a series of generated comparisons, one for each character. In contrast, for repeaters, we code gen an actual loop. We should explore in what situations it might be beneficial to unroll (partially) a repeater as well, or conversely, whether there are cases where we’re better off generating a loop for strings rather than unrolling.
  • Use string/span.IndexOf{Any} methods in interpreter. As long as we can quickly determine that they should be used, it should in general be a win to use these methods from the interpreter in some of the same places we use them in the compiled implementation(s).
  • Compare more than one char at a time. In some cases, such as when validating a string, we compare one character at a time (even if it's via an unrolled loop). We should be able to read a larger unit and compare it against a larger unit, e.g. rather than reading a char and comparing it against 'n' and then another and comparing it against 'o', we could read an Int32 and compare it against something like 'n' << 16 | 'o'. We can do this with Int64 or even SIMD vectors.
  • IndexNotOf. Consider adding a vectorized IndexNotOf method (https://github.com/dotnet/corefx/issues/35594#issuecomment-572834317), and then use that to implement oneloop{atomic} as well as setloop{atomic} for a set containing just a few known characters. If we wanted to do this but weren't yet convinced about public API surface area, IndexNotOf could be added as an internal method in System.Text.RegularExpressions.dll, and then used from the in-memory DynamicMethod compiled regexes (but not from those produced by CompileToAssembly).
  • Improve vectorization of existing helpers. We use existing .NET searching methods now whenever possible to implement the various performed searches, e.g. IndexOf, IndexOfAny, etc. Most of these have some level of vectorization now, but considering how much time we can end up spending in the routines while executing regexes, it's worth spending more time optimizing them, e.g. Improve SpanHelpers.IndexOfAny throughput / vectorization #25023

Tree rewriting:

  • Factor out substrings from alternation. Consider a regex like (?:this|that) is great. This could be automatically rewritten as th(?:is|at) is great, which has several potential benefits: a) it exposes a prefix that the Boyer-Moore matching will see and be able to use in FindFirstChar, b) it reduces the amount of backtracking that’ll be necessary (backtracking to after rather than to before the "th"), and c) the string might be able to partake in other optimizations, such as being used to convert earlier loops into atomic ones.
  • Improve auto-atomicification. I added basic support to find backtracking constructs where backtracking would never help and to automatically make them atomic. That has three benefits: it makes failures to match much faster, it results in smaller code gen in general, and it enables our non-backtracking implementation to have a greater chance of kicking in. However, while the implementation covers a decent number of patterns, it leaves more on the table. For example, consider the regex '[^']*' which finds a pair of single quotes and everything between them. We’ll detect that the [^']* can’t possibly benefit from backtracking, because it’ll never match the character that’s required after this loop. However, if we had that exact same loop as part of an alternation, e.g. ('[^']*|something)', we won’t detect that. Currently the implementation only does two checks: it looks at successor nodes in a concatenation, and it looks at the last node in the regex that has no successors. Obviously there are plenty of other cases where analysis can quickly prove upgrading to an atomic grouping is correct. (Note that other implementations call this auto-possessification, named after a possessive quantifier. .NET doesn’t have a possessive quantifier, but a possessive quantifier is just shorthand for an atomic group, which .NET does have.)
  • Avoid unnecessary alternation backtracking. Consider a regex like (Hello|Hi), .*\. How are you\? If we feed this a string like Hello, steve. How are ya?, it’ll match the Hello, , but fail to match the rest… at that point, it’ll then start over and try with Hi, but there’s no point in doing so, as if Hello matched, Hi can’t possibly. We can detect this statically from the regex pattern, and then if every branch N has a required prefix that’s not a substring of every < N’s prefix, make the alternation atomic, e.g. wrap it in (?> … ).
  • Eliminate unnecessary alternation branches. We can statically determine if it's impossible for an alternation branch N + 1 to match after we've gotten to a certain place in matching branch N, e.g. given the alternation "abc|def" we know that once we match the 'a', there's no point in even considering the second branch. We should be able to utilize that knowledge to avoid unnecessarily checking branches when a previous one fails to match.
  • Make alternations atomic when possible. If we see that if any branch of the alternation matches it's impossible for another branch to match (e.g. if they all begin with different letters), then we can make the alternation atomic, as no amount of backtracking into the alternation after it's matched will help.
  • Optimize regexes starting with .*. If the RegexNode.FinalOptimize step sees that the Regex begins with a star loop (or .+), as long as it’s not part of an explicit capture or alternation, it can prepend a ^ anchor. That can avoid a ton of work as FindFirstChar will then return false on a subsequent call rather than bumping by 1 and going again. It might be easier to do this in the parser.
  • Merge more nodes in concatenation reduction. We already do a pass over concatenations to reduce the nodes and combine them where we can, e.g. combining five consecutive "one" nodes into a "multi". But we can do more, e.g. combining an expression like "[abcd]+[abcd]" into "[abcd]{2,}". This has multiple benefits. For one, it allows us more freedom in how we generate optimized code / handling of the tree. Second, it exposes additional optimization opportunity, e.g. if the expression begins with ..* and we combine that to be .+, we can then more easily see that the expression begins with a star loop that lets us automatically anchor it.
  • Better node searching for atomic overlap comparison check. We now attempt to convert non-atomic loops into atomic ones whenever possible. We do this by finding a loop and then looking at what comes after it and seeing if there's any overlap, e.g. with an expression [ab]*[cd], we'll see that there's no overlap between what can be matched by the loop and what comes next, and we'll make the loop atomic. We would similarly do that if the expression were [ab]*[cd]+, because we see that what comes after the first loop must occur at least once. However, if the expression were [ab]*[cd]*, we won't make the first loop atomic, because we see that the second loop has a min repeat count of 0 and thus isn't guaranteed to appear. We could be more aggressive here, and look at all possible subsequent nodes. For example, here we would see that there's nothing after the second loop in the expression, and so if the second pattern does appear, we won't overlap with it, and if the second pattern doesn't appear, we're at the end, so either way we can make it atomic. Similarly, if the expression were [ab]*[cd]*[ef], we would make both loops atomic; for the first, we'd see that it could be followed by either [cd] or [ef], and it doesn't overlap with either.

Better handling of specific subexpressions:

  • Optimize .*somestring. It’s pretty common to see a star loop followed by a string. Today’s implementation is very naïve: it dutifully searches for the \n, tries to compare the string there, then backs off by 1, tries to compare the string there, then backs off 1, tries to compare the string there, and so on. We can instead use an algorithm like Aho-Corasick to search back through the star loop, or even just a vectorized LastIndexOf for the string.
  • Add Capture support to the non-backtracking code gen implementation. This is one of the main things that keeps a non-trivial number of regexes from switching to using this faster implementation, primarily from what I’ve seen because developers don’t think about the fact that () is actually a capture group, and so something like foo(bar|zoo) which uses the parens to separate the alternation is actually also adding a capture (instead of the dev having written foo(?:bar|zoo)). Adding capture support to the non-backtracking codegen shouldn’t be hard, and it’ll allow more cases to switch over to that implementation.
  • Enlighten FindFirstChar about Bol anchor. The beginning of line anchor currently isn't used by FindFirstChars, but we should be able to use it to do a fast IndexOf search for a newline in order to position the search at the next viable location. This would make the ^ anchor much faster when RegexOptions.Multiline is used (though it's not clear how common it is to use it).
  • Reordering checks in codegen. We currently always evaluate the match in order, e.g. for a pattern "a\sb", we'll check the 'a', then the whitespace, and then the 'b'. But that's not actually required. For fixed-size regions like this, we can choose to reorder the checks and do the cheapest ones first, and/or do first the ones most likely to not match. For example, in this expression, we might want to do the 'b' check before the whitespace check, because the single comparison for the 'b' is going to be faster than the char.IsWhitespace for the \s. As another example, if the expression were "a\sz", all else being equal, we're probably less likely to see 'z' in text being matched than we are 'a', so if we had the choice to do the 'z' check first, it might be wise to do so.

Character class matching:

  • Improve handling of Unicode characters with negative sets. Consider a set like [^"]. This matches every character except a quote. Today we generate an ASCII bitmap for each set, but then for inputs >= 128, if it’s possible a Unicode character could match, we codegen a fallback that calls RegexCharClass.CharInClass. However, with a negation like this one, we know that not only might some Unicode characters match, we know that every Unicode character is going to match. So, instead of generating a fallback that calls RegexCharClass.CharInClass, we can just generate a fallback that returns true. This will not only significantly speed-up the handling of Unicode characters, it’ll also result in smaller code, which should help overall performance. For example, today we end up generating code along the lines of if ((uint)char < 128) result = lookup[char]; else result = CharInClass(char);; for a set like this, we can instead generate something like result = (uint)char >= 128 || lookup[char];.
  • Improve handling of sets with Unicode categories. Consider a set with a single Unicode category, like \d. Today we generate a lookup table for ASCII, and then fallback to calling CharInClass (in actuality, '\d` has special handling). We can instead just generate a call to char.GetUnicodeCategory and compare its category to the desired const. We might not even need to generate the fast-path lookup table for this case, as GetUnicodeCategory already has a fast-path lookup table for ASCII, but we’d need to measure. Then consider a set like \w. It maps to multiple Unicode categories. Instead of having the fast-path lookup table and then a fallback that calls CharInClass, we could have the fallback be a call to UnicodeCategory.GetUnicodeCategory, and codegen the comparison against the known Unicode categories we care about.
  • Explore using larger lookup tables than 128. For example, would doubling the size to all characters < 256 help in enough additional cases to be worthwhile?
  • Built-in helper inlining. For known sets like \s, we now call out to a helper method, e.g. char.IsWhitespace. we should validate those calls are getting inlined.

General overhead:

  • Avoid two virtual calls plus two delegate calls per character. Each call to FindFirstChar is a virtual call, and it invokes the DynamicMethod via a delegate. If it matches, we then make a virtual call to Go, and it again calls the generated code via a delegate. That means if we frequently find a starting character and it doesn’t match the regex, we end up paying four virtual/delegate invocations for the character on top of the actual matching logic. We could change the codegen for FindFirstChar to also do the work of Go (even if it just calls Go, though better would be actually integrated); at that point Go can be a nop and would only be invoked by the base loop when there was actually a match.
  • Avoid duplicate work between FindFirstChar and Go. Similar / overlapping with the previous case, we currently duplicate work between FindFirstChar and Go in some cases. For example, a simple regex like hello will first validate the match in FindFirstChar and then redo the match in Go, even though we know it already matched. In such a case, we should be able to effectively make Go a nop.
  • Avoid unnecessary RegexWriter usage. We currently do the non-backtracking check after we’re already in the process of reflection emitting code. If we were to do it earlier, we could avoid using RegexWriter at all, since its result is just ignored if we take the non-backtracking path.
  • Reduce allocations specific to backtracking. RegexRunner initializes several int[]s for backtracking-related state, but if we end up employing the non-backtracking codegenerator, these are wasted. We can avoid creating them at all if they won't be used. While the runner gets reused for serial use, if a Regex instance is used concurrently (which happens implicitly when using the static Regex methods), threads that don't get the shared runner will end up creating their own temporary instance, and each of those will end up incurring these unnecessary allocations.
  • Better bounds-check elimination. We don't currently generate code in a way that will optimally eliminate bounds checks, e.g. we're not eliminating the bounds checks on the character class strings we output for fast lookups. Since we're doing code generation and spitting out IL directly, we could choose to just use IL-based opcodes to manipulate refs directly and avoid bounds checking altogether. (I tried eliminating all bounds checking and it made very little difference.)
  • Add a minimum length optimization. We already walk the RegexNode tree as part of the check to see whether we can use the non-backtracking implementation; as part of that (or separate from it) we can easily compute a minimum bound on the length of any matching string, e.g. foo[abc]{2,4} has a minimum length of 5. FindFirstChar can check upfront to see whether the input string is at least that long, and if it’s not, immediately fail.
  • Replace when length == replacement length. We currently do an analysis to determine the minimum length a regex can match. It would not be hard to also do an analysis for the maximum length a regex can match. If those values are the same, then every match would be a specific length, and if that length is the same length as a string used in Regex.Replace as the replacement, we can be more efficient about how we do the replacement. We currently build up a list of segments (text/start/count) and then create a new string and copy everything into it. But if we know the length of the output string will be identical to the length of the input string, regardless of how many replacements happen (because replacements won't change the length), then we can create the string and do the replace directly into it, rather than building up the intermediate list of segments.
  • Replace falling back to string.Replace. We could analyze a regex to determine that it contains simple text (e.g. a single character or a "multi"), and in such cases for some of the overloads fall back to just calling string.Replace.

Fundamental engine changes / replacements:

  • Employ a form of tiered compilation. If someone doesn't specify RegexOptions.Compiled, we could consider tracking usage of the Regex instance, and when it hits a certain threshold, asynchronously upgrade it to being compiled. We'd queue the compilation of the expression, and then backpatch the instance with the upgraded runner, such that all future invocations would use the compiled version instead. The upside to this is faster execution when RegexOptions.Compiled should have been specified but wasn't. There are a few potential downsides. We'd need to always carry around the compiler in a trimmed app, even if the compiler wouldn't otherwise be used. We'd potentially waste time/space compiling a regex that wasn't actually important. And we'd risk subtle discrepancies that show up over time if we ever have a bug that results in different execution semantics between non-compiled and compiled.
  • Try to unify the multiple go method code generators. I added the non-backtracking implementation because lots of regexes don’t actually need the backtracking, and it adds a non-trivial amount of overhead. We might, however, be able to be smarter about how we implement things, and make it pay-for-play, with a single implementation that only adds in the backtracking code if a construct requires it. We partially have that with the two implementations today, where we check to see if we can use one and then fallback to the other, so the goal here would be to see if we can a) minimize some of the duplication, and b) maintain the non-backtracking codegen benefits in more cases.
  • PCRE light-up. Just as we analyze the RegexNode tree to see whether we can use our non-backtracking implementation, we could analyze it to see whether we could use PCRE, and do so if it’s dynamically found on the machine. The .NET APIs support features PCRE doesn’t, so it would only work if a subset of functionality was required.
  • Consider using memoization to reduce backtracking costs. This can help avoid the huge costs of backtracking, and the expense of significant memory consumption. Some implementations balance this by having a max cache size and evicting from the cache when it's full and new states are added. Essentially for each backtracking construct in the expression (e.g. a loop), we track all positions in the input string for which no match was found, and then before backtracking we check to see if we've tried that position before. Worst case, this is O(expression_length * input_length) space, but optimizations are possible on top of that, e.g. storing ranges instead of all individual items, only storing X results that are most impactful, etc.
  • Consider using a DFA implementation for some regexes. Other implementations do this, with multiple engines employed based on the target expression.

cc: @danmosemsft, @eerhardt, @ViktorHofer, @davidwrighton

Metadata

Metadata

Labels

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions