Skip to content

[API Proposal] Add cultureName constructors to GeneratedRegex #59492

@stephentoub

Description

@stephentoub

This issue has now been turned into an API proposal for the missing GeneratedRegex constructors that take in a cultureName. The previous description of the issue has been moved bellow

Background and motivation

Regex engines are able to do case insensitive searches when you use RegexOptions.IgnoreCase or when you use a pattern that signals that it is case-insensitive, like: (?i)abc. For better performance, our engines translate the passed in patterns into their case-sensitive equivalence so that we can benefit of some optimizations like using vectorized search, for example, a pattern (?i)abc will be translated into the equivalent [Aa][Bb][Cc].

Case conversions are not standard across all cultures, which means that when making that translation, it matters a lot which culture is used in order to produce the translation. For most of our engines (interpreted, Compiled, nonBacktracking) this translation happens at runtime, and will use the CultureInfo.CurrentCulture at runtime in order to perform this translation. The remaining engine, however, is different. Because it is the source generated engine, it means that this translation will happen at build-time of the application, which means that the culture won't necessarily be the same as the one you have at runtime. Today, the source generated engine will use the Culture at build-time to perform the translation, and there is no way to select a different culture, which is problematic, since even if you are able to change the culture of your machine if you are building the application, this won't be possible if the code gets built in a build server. This proposal is to add an extra string parameter to the GeneratedRegex attribute so that you can optionally select which culture should be used to perform the translation.

API Proposal

[AttributeUsage(AttributeTargets.Method, AllowMultiple = false, Inherited = false)]
public sealed class GeneratedRegexAttribute : System.Attribute
{
    public GeneratedRegexAttribute([StringSyntax(StringSyntaxAttribute.Regex)] string pattern) { }
    public GeneratedRegexAttribute([StringSyntax(StringSyntaxAttribute.Regex, "options")] string pattern, RegexOptions options) { }
+   public GeneratedRegexAttribute([StringSyntax(StringSyntaxAttribute.Regex, "options")] string pattern, RegexOptions options, string cultureName) { }
    public GeneratedRegexAttribute([StringSyntax(StringSyntaxAttribute.Regex, "options")] string pattern, RegexOptions options, int matchTimeoutMilliseconds) { }
+   public GeneratedRegexAttribute([StringSyntax(StringSyntaxAttribute.Regex, "options")] string pattern, RegexOptions options, int matchTimeoutMilliseconds, string cultureName) { }

+   public string CultureName { get; }
}

API Usage

public partial class Foo
{
    [GeneratedRegex(@"(?i)abc", RegexOptions.None, "en-US")]
    public static partial Regex ThreeParameterRegex();

    [GeneratedRegex(@"abc", RegexOptions.IgnoreCase, /*matchTimeoutInMilliseconds*/ 5000, "en-US")]
    public static partial Regex FourParameterRegex();
}

Draft PR with the implementation

#73490

Original Issue Description

If a developer specifies RegexOptions.IgnoreCase to a regular expression (or uses the inline ignore casing option in the expression), the implementation(s) then need to do casing operations to determine whether things match. That's implemented today in two parts:

  1. At construction time, values are lowercased in the pattern.
  2. At match time, the text being compared is lowercased.

If the developer specifies RegexOptions.CultureInvariant, that's fine, as both (1) and (2) will be done with CultureInfo.InvariantCulture. But if CultureInvariant isn't specified, it's possible that the CultureInfo.CurrentCulture at the time of (1) will differ from the time of (2), in which case the lowercasing can be performed with different cultures, leading to problems. Note that this really only affects the instance methods on Regex. The static methods on Regex access a cache of Regex instances, and the culture used to create the Regex instances is included in the key to that cache. Also note that the docs talk about the current culture being used, but they don't specify current "when".

We have a few options to fix this:

  1. Keep the current discrepancy. Using a different culture to match from the culture when the Regex was created is buggy.
  2. Use the culture that was present at the time of construction. This would mean the Regex would cache that CultureInfo instance (or TextInfo) instance, and use it for match operations rather than having those match operations each consult CultureInfo.CurrentCulture. (This could even have a small positive impact on perf, as it means we'd be accessing a field rather than a thread static on each match operation.)
  3. Use the culture that's present at the time of match. This means we couldn't do at construction time many of the optimizations that are currently done, and we'd need to come up with some caching scheme where we lazily compute the necessary data and find the right set of data to use at match time. In the extreme this could include maintaining a cache per culture of all the regex's state and then finding the right state to use as part of the match, essentially a limited version of what we do in the static methods. There is very likely to be a negative perf impact from this.

We should investigate whether the discrepancy actually matters, e.g. in practice, do all regexes where this could matter:

  • not use IgnoreCase
  • use InvariantCulture
  • only ever access it from the same culture in which it was constructed or where casing rules are the same as the culture in which it was constructed
  • have a pattern that's not susceptible to issues (e.g. is)

My gut is we should do (2), primarily as it's explainable and simple. We're already doing some such computation at construction time, so it's unlikely a robust dependency could have been taken on the use of the culture only at match time.

If we do (2), there are then additional complications for the regex source generator. All the computation that normally happens in the Regex constructor happens at build time with the regex source generator, which means CultureInfo.CurrentCulture is determined at the time of build, whatever the culture is for the process doing the build / generating the source code. We have a few options there:

  1. Keep that design, using whatever the current culture is at build. That's probably not ideal.
  2. Always use invariant culture in the source generator, regardless of whether RegexOptions.CultureInvariant is specified.
  3. Fail to generate code if casing might be required (either RegexOptions.IgnoreCase or the inline i option is specified somewhere in the expression) and RegexOptions.CultureInvariant wasn't specified
  4. Add overloads to the RegexGenerator attribute constructor that accept a string cultureName, which is then used. Either get rid of the other overloads, or have them do one of the other options.

I'm tempted to say we should do (4), and if you use a ctor that doesn't take a culture, fall back to (3).

cc: @GrabYourPitchforks, @tarekgh

Related:
#36147
#58958

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions