-
Notifications
You must be signed in to change notification settings - Fork 5.2k
Description
Prior to .NET 5, both the RegexInterpreter and the RegexCompiler worked on the same strategy: RegexWriter would transform the RegexNode tree into a series of opcodes and operands, which would then be processed. RegexInterpreter would process them at match time, and RegexCompiler would emit equivalent code for them, except with various constructs hardcoded, evaluating at compile-time what would otherwise be evaluated at match-time. This included various optimizations, like unrolling loops. However, it also a) left a non-trivial amount of optimization opportunity on the table, and b) results in difficult to follow code; the crux of the issue is RegexCompiler still following the same plan with many of the same limitations as RegexInterpreter, just slightly optimized by avoiding various kinds of branches at match time. This is all the more impactful with the source generator added in .NET 7, which spits out C# code intended to be human-readable/debuggable/etc.
To help with this, in .NET 5 we started incrementally added a new code generation strategy for the matching engine. Rather than being based on the opcodes output by RegexWriter, it instead outputs code based on walking the original RegexNode tree, resulting in baking the structure of the processing into the code rather than relying on the structure being pushed and popped onto the runstack/runtrack arrays used by RegexInterpreter (and RegexCompiler prior). While it was originally added to aid in performance, its impact is an order of magnitude larger for readability for the source generator in .NET 7, as it results in code much closer to that which a dev would actually write if they were manually implementing the same functionality.
However, we only provided a subset of support in .NET 5. Based on our corpus of "real-world" regexes, the .NET 5 implementation was able to support ~40% of regexes, which those using the new strategy, and the remainder falling back to the old strategy. We've been picking away at this, and as of today, ~70% of regexes are now covered. There are only a few remaining categories of work remaining to address the remaining gap:
- Greedy, non-atomic node loops (RegexNode.Loop). Lazy loops are already supported, as are all forms of loops around a single character ("one", "notone", "set"), and all atomic loops. (Add backtracking loops, backreferences, and if-then-else constructs to Regex "simplified" code gen #61906)
- Backreferences (RegexNode.Ref) (Add backtracking loops, backreferences, and if-then-else constructs to Regex "simplified" code gen #61906)
- Backreference conditionals (RegexNode.Testref) (Add backtracking loops, backreferences, and if-then-else constructs to Regex "simplified" code gen #61906)
- Expression conditionals (RegexNode.Testgroup) (Add backtracking loops, backreferences, and if-then-else constructs to Regex "simplified" code gen #61906)
- Balancing groups. (Add backtracking loops, backreferences, and if-then-else constructs to Regex "simplified" code gen #61906)
- Alternations that aren't top-level or atomic.
- Loops in all locations. They're currently not supported inside other backtracking constructs.
- Captures in all locations. Currently only some locations for captures are supported.
- RegexOptions.RightToLeft
- Lookbehinds (positive and negative)
We're on track to have all but the last two (RightToLeft and lookbehinds) implemented for .NET 7. Today, lookbehinds are implemented on top of RegexOptions.RightToLeft support, so it's difficult to properly implement lookbehinds without RightToLeft, and RightToLeft is a large undertaking.
However, in our corpus of "real-world" regexes, only 0.6% have RegexOptions.RightToLeft defined at the top-level, and only 3.3% involve RegexOptions.RightToLeft anywhere (meaning they use either RightToLeft or lookbehinds, which are based on RightToLeft), and only 1.4% involve RegexOptions.RightToLeft and RegexOptions.Compiled. So, it would be reasonable in the name of maintainability to say that we only support RightToLeft and lookbehinds in the interpreter, such that if you use either of those, even if you specify RegexOptions.Compiled (i.e. you're in that 1.4% of expressions), you'll get the interpreter, and if you use the source generator, we'd spit out the fallback path that just emits a cached new Regex(...)
instance. This might represent a perf regression for that very small subset of use cases. However, a big benefit would be not having to fully implement RightToLeft in the new codegen strategy, getting to delete several thousand lines of difficult code from RegexCompiler and several thousand lines of difficult code from the source generator, and we could also stop running RegexWriter at all for RegexCompiler and the source generator. And we could mark various protected members of RegexRunner as obsolete and consider removing them in the future.