-
Notifications
You must be signed in to change notification settings - Fork 5.2k
Description
This is related to and impacts / is impacted by #59492, but they are distinct.
Today, Regex implements RegexOptions.IgnoreCase by lowercasing everything in the pattern according to the culture at the time of construction, and then at the time of match it lowercases everything in the input lazily as part of evaluating whether something matches. There are multiple issues with this, including the fact that if the cultures differ between construction and match time, this might lead to incorrect results due to differences in how things case. This leads to problems like #60753 and #36147, and it indirectly leads to complications like #36149. Those should all be addressed by properly fixing #59492. This also runs counter to the general argument that if you need to lower or upper case to determine equality, you should upper case.
Beyond the functional issues, however, it also leads to non-trivial performance problems. First, it means we need to call ToLower{Invariant} for every character of the input, and potentially multiple times if there's any backtracking. But worse, any time we come across a case-insensitive character, it knocks us off our fast paths, for a variety of things. For example, we can no longer use vectorized IndexOf{Any} to search for a character, and instead need to walk character-by-character calling ToLower on each to then compare against the lowercase value. If we could instead just have a set of all of the characters that should be treated as equivalent from an ignore-casing perspective, then the rest of our optimizations based on sets should kick in. Let's say the pattern begins with 'a' and is IgnoreCase. Today we end up walking each character, calling ToLower on each, and comparing it to 'a'. If we fix this, we could instead use IndexOfAny('A', 'a'). Or even in situations where we're forced to compare character by character, we could still avoid having to call ToLower, as the set will already contain all valid, which also means we can avoid having to query CultureInfo.CurrentCulture.
We should consider overhauling the scheme employed:
- The only culture that matters is the one present when the Regex is constructed (or InvariantCulture if RegexOptions.CultureInvariant is specified). All (rather than just some) casing decisions are made then. This would also effectively answer the primary question in [API Proposal] Add cultureName constructors to GeneratedRegex #59492 (we'd still need to make a decision on what to do for the source generator).
- When the regex is constructed and the pattern analyzed, rather than creating sets that contain the original character and its lowercased version, we create sets that contain that character and any others that should be considered equal according to casing rules.
- We stop using ToLower everywhere else in the codebase.
Note that the new NonBacktracking engine does something similar, which a) means we have duplicated logic with distinct approaches, and b) we have differences in behavior between the engines, in particular around handling of Turkish I.
This does mean we'll need some mechanism to determine, for any given culture, what characters should be considered equivalent under IgnoreCase. In the extreme, you could imagine that for every character we upper case it, and everything that uppercased to the same value is considered to be part of the same IgnoreCase equivalence class... but that's something we'd ideally not do at run time, given the obvious overheads. We also need to consider how to efficiently handle ranges. Again, the NonBacktracking implementation has some prior art here, but it also has some things we need to address, e.g. #60753 (comment).
cc: @tarekgh, @GrabYourPitchforks, @veanes, @olsaarik, @danmoseley