Skip to content

Make enum RegexParseError and RegexParseException public #38872

Closed
@abelbraaksma

Description

@abelbraaksma

Background and Motivation

A regular expression object made with System.Text.Regex is essentially an ad-hoc compiled sub-language that's widely used in the .NET community for searching and replacing strings. But unlike other programming languages, any syntax error is raised as an ArgumentException. Programmers that want to act on specific parsing errors need to manually parse the error string to get more information, which is error-prone, subject to change and sometimes non-deterministic.

We already have an internal RegexParseException and two properties: Error and Offset, which respectively give an enum of the type of error and the location in the string where the error is located. When presently an ArgumentException is raised, it is in fact a RegexParseException which inherits ArgumentException.

I've checked the existing code and I propose we make RegexParseException and RegexParseError public, these are pretty self-describing at the moment, though the enum cases may need better named choices (suggested below) . Apart from changing a few existing tests and adding documentation, there are no substantive changes necessary.

Use cases

  • Online regex tools may use the more detailed info to suggest corrections of the regex to users (like: "Did you forget to escape this character?").
  • Debugging experience w.r.t. regular expressions improves.
  • Currently, getting col and row requires parsing the string, and isn't in the string in Framework. Parsing in i18n scenarios is next to impossible, giving an enum and position helps writing better, and deterministic code
  • Improve tooling by using the offset to place squiggles under errors in regexes.
  • Self-correcting systems may use the extra info to close parentheses or brackets, or fix escape sequences that are incomplete.
  • It is simply better to be able to check for explicit errors than the more generic ArgumentException which is used everywhere.
  • BCL tests on regex errors now uses reflection, this is no longer necessary.

Related requests and proposals

Proposed API

The current API already exists but isn't public. The definitions are as follows:

    [Serializable]
-    internal sealed class RegexParseException : ArgumentException
+    public class RegexParseException : ArgumentException
    {
        private readonly RegexParseError _error; // tests access this via private reflection

        /// <summary>Gets the error that happened during parsing.</summary>
        public RegexParseError Error => _error;

        /// <summary>Gets the offset in the supplied pattern.</summary>
        public int Offset { get; }

        public RegexParseException(RegexParseError error, int offset, string message) : base(message)
        {
+            // add logic to test range of 'error' and return UnknownParseError if out of range
            _error = error;
            Offset = offset;
        }

        public override void GetObjectData(SerializationInfo info, StreamingContext context)
        {
            base.GetObjectData(info, context);
            info.SetType(typeof(ArgumentException)); // To maintain serialization support with .NET Framework.
        }
    }

And the enum with suggested names for a more discoverable naming scheme. I followed "clarity over brevity" and have tried to start similar cases with the same moniker, so that an alphabetic listing gives a (somewhat) logical grouping in tooling.

I'd suggest we add a case for unknown conditions, something like UnknownParseError = 0, which could be used if users create this exception by hand with an invalid enum value.

Handy for implementers: Historical view of this prior to 22 July 2020 shows the full diff for the enum field by field. On request, it shows all as an addition diff now, and is ordered alphabetically.

-internal enum RegexParseError
+public enum RegexParseError
{
+    UnknownParseError = 0,    // do we want to add this catch all in case other conditions emerge?
+    AlternationHasComment,
+    AlternationHasMalformedCondition,  // *maybe? No tests, code never hits
+    AlternationHasMalformedReference,  // like @"(x)(?(3x|y)" (note that @"(x)(?(3)x|y)" gives next error)
+    AlternationHasNamedCapture,        // like @"(?(?<x>)true|false)"
+    AlternationHasTooManyConditions,   // like @"(?(foo)a|b|c)"
+    AlternationHasUndefinedReference,  // like @"(x)(?(3)x|y)" or @"(?(1))"
+    CaptureGroupNameInvalid,           // like @"(?< >)" or @"(?'x)"
+    CaptureGroupOfZero,                // like @"(?'0'foo)" or @("(?<0>x)"
+    ExclusionGroupNotLast,             // like @"[a-z-[xy]A]"
+    InsufficientClosingParentheses,    // like @"(((foo))"
+    InsufficientOpeningParentheses,    // like @"((foo)))"
+    InsufficientOrInvalidHexDigits,    // like @"\uabc" or @"\xr"
+    InvalidGroupingConstruct,          // like @"(?" or @"(?<foo"
+    InvalidUnicodePropertyEscape,      // like @"\p{Ll" or @"\p{ L}"
+    MalformedNamedReference,           // like @"\k<"
+    MalformedUnicodePropertyEscape,    // like @"\p{}" or @"\p {L}"
+    MissingControlCharacter,           // like @"\c"
+    NestedQuantifiersNotParenthesized  // @"abc**"
+    QuantifierAfterNothing,            // like @"((*foo)bar)"
+    QuantifierOrCaptureGroupOutOfRange,// like @"x{234567899988}" or @"x(?<234567899988>)" (must be < Int32.MaxValue)
+    ReversedCharacterRange,            // like @"[z-a]"   (only in char classes, see also ReversedQuantifierRange)
+    ReversedQuantifierRange,           // like @"abc{3,0}"  (only in quantifiers, see also ReversedCharacterRange)
+    ShorthandClassInCharacterRange,    // like @"[a-\w]" or @"[a-\p{L}]"
+    UndefinedNamedReference,           // like @"\k<x>"
+    UndefinedNumberedReference,        // like @"(x)\2"
+    UnescapedEndingBackslash,          // like @"foo\" or @"bar\\\\\"
+    UnrecognizedControlCharacter,      // like @"\c!"
+    UnrecognizedEscape,                // like @"\C" or @"\k<" or @"[\B]"
+    UnrecognizedUnicodeProperty,       // like @"\p{Lll}"
+    UnterminatedBracket,               // like @"[a-b"
+    UnterminatedComment,
}

* About IllegalCondition, this is thrown inside a conditional alternation like (?(foo)x|y), but appears to never be hit. There is no test case covering this error.

Usage Examples

Here's an example where we use the additional info to give more detailed feedback to the user:

public class TestRE
{
    public static Regex CreateAndLog(string regex)
    {
        try
        {
            var re = new Regex(regex);
            return re;
        }
        catch(RegexParseException reExc)
        {
            switch(reExc.Error)
            {
                case RegexParseError.TooFewHex:
                    Console.WriteLine("The hexadecimal escape contains not enough hex characters.");
                    break;
                case RegexParseError.UndefinedBackref:
                    Console.WriteLine("Back-reference in position {0} does not match any captures.", reExc.Offset);
                    break;
                case RegexParseError.UnknownUnicodeProperty:
                    Console.WriteLine("Error at {0}. Unicode properties must exist, see http://aka.ms/xxx for a list of allowed properties.", reExc.Offset);
                    break;
                // ... etc
            }
            return null;
        }
    }
}

Alternative Designs

Alternatively, we may remove the type entirely and merely throw an ArgumentException. But it is likely that some people rely on the internal type, even though it isn't public, as through reflection the contextual information can be reached and is probably used in regex libraries. Besides, removing it will make any future improvements in dealing with parsing errors and proposing fixes in GUIs much harder to do.

Risks

The only risk I can think of is that after exposing this exception, people would like even more details. But that's probably a good thing and only improves the existing API.

Note that:

  • Existing code that checks for ArgumentException continues to work.
  • While debugging, people who see the underlying exception type can now actually use it.
  • Existing code using reflection to get to the extra data may or may not not continue to work, it depends on how strict the search for the type is done.

[danmose: made some more edits]

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions