Skip to content

Regex isn't factoring target culture into lowercasing of ranges #36147

@stephentoub

Description

@stephentoub

e.g.

using System;
using System.Globalization;
using System.Text.RegularExpressions;
class Program
{
    static void Main()
    {
        CultureInfo.CurrentCulture = new CultureInfo("tr-TR");
        var r = new Regex(@"[A-Z]", RegexOptions.IgnoreCase);
        Console.WriteLine(r.IsMatch("\u0131")); // should print true, but prints false
    }
}

In Turkish, I lowercases to ı (\u0131), so the above repro should print out true. But whereas Regex is using the target culture when dealing with individual characters in a set:

SingleRange range = rangeList[i];
if (range.First == range.Last)
{
char lower = culture.TextInfo.ToLower(range.First);
rangeList[i] = new SingleRange(lower, lower);
}

when it instead has a range with multiple characters, it delegates to this AddLowercaseRange function:

which doesn't factor in the target culture into its decision, instead using a precomputed table:
private static readonly LowerCaseMapping[] s_lcTable = new LowerCaseMapping[]

@tarekgh, @GrabYourPitchforks, am I correct that such a table couldn't possibly be right, given that different cultures case differently?

Note that if the above repro is instead changed to spell out the whole range of uppercase letters:

using System;
using System.Globalization;
using System.Text.RegularExpressions;
class Program
{
    static void Main()
    {
        CultureInfo.CurrentCulture = new CultureInfo("tr-TR");
        var r = new Regex(@"[ABCDEFGHIJKLMNOPQRSTUVWXYZ]", RegexOptions.IgnoreCase);
        Console.WriteLine(r.IsMatch("\u0131")); // prints true
    }
}

it then correctly prints true.

cc: @eerhardt, @pgovind

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions