Description
Background and Motivation
A regular expression object made with System.Text.Regex
is essentially an ad-hoc compiled sub-language that's widely used in the .NET community for searching and replacing strings. But unlike other programming languages, any syntax error is raised as an ArgumentException
. Programmers that want to act on specific parsing errors need to manually parse the error string to get more information, which is error-prone, subject to change and sometimes non-deterministic.
We already have an internal RegexParseException
and two properties: Error
and Offset
, which respectively give an enum of the type of error and the location in the string where the error is located. When presently an ArgumentException
is raised, it is in fact a RegexParseException
which inherits ArgumentException
.
I've checked the existing code and I propose we make RegexParseException
and RegexParseError
public, these are pretty self-describing at the moment, though the enum
cases may need better named choices (suggested below) . Apart from changing a few existing tests and adding documentation, there are no substantive changes necessary.
Use cases
- Online regex tools may use the more detailed info to suggest corrections of the regex to users (like: "Did you forget to escape this character?").
- Debugging experience w.r.t. regular expressions improves.
- Currently, getting col and row requires parsing the string, and isn't in the string in Framework. Parsing in i18n scenarios is next to impossible, giving an enum and position helps writing better, and deterministic code
- Improve tooling by using the offset to place squiggles under errors in regexes.
- Self-correcting systems may use the extra info to close parentheses or brackets, or fix escape sequences that are incomplete.
- It is simply better to be able to check for explicit errors than the more generic
ArgumentException
which is used everywhere. - BCL tests on regex errors now uses reflection, this is no longer necessary.
Related requests and proposals
- In this comment (Regex should provide a validation method #13942 (comment)) @terrajobst first proposed to open-up this exception for improving error handling.
- Original proposal by @danmosemsft: Make RegexParseException public #1080
- Tooling improvements with squiggles: Porting roslyn regex tests to corefx corefx#29178 (comment) and Make RegexParseException public #372 (comment)
Proposed API
The current API already exists but isn't public. The definitions are as follows:
[Serializable]
- internal sealed class RegexParseException : ArgumentException
+ public class RegexParseException : ArgumentException
{
private readonly RegexParseError _error; // tests access this via private reflection
/// <summary>Gets the error that happened during parsing.</summary>
public RegexParseError Error => _error;
/// <summary>Gets the offset in the supplied pattern.</summary>
public int Offset { get; }
public RegexParseException(RegexParseError error, int offset, string message) : base(message)
{
+ // add logic to test range of 'error' and return UnknownParseError if out of range
_error = error;
Offset = offset;
}
public override void GetObjectData(SerializationInfo info, StreamingContext context)
{
base.GetObjectData(info, context);
info.SetType(typeof(ArgumentException)); // To maintain serialization support with .NET Framework.
}
}
And the enum
with suggested names for a more discoverable naming scheme. I followed "clarity over brevity" and have tried to start similar cases with the same moniker, so that an alphabetic listing gives a (somewhat) logical grouping in tooling.
I'd suggest we add a case for unknown conditions, something like UnknownParseError = 0
, which could be used if users create this exception by hand with an invalid enum value.
Handy for implementers: Historical view of this prior to 22 July 2020 shows the full diff for the enum field by field. On request, it shows all as an addition diff now, and is ordered alphabetically.
-internal enum RegexParseError
+public enum RegexParseError
{
+ UnknownParseError = 0, // do we want to add this catch all in case other conditions emerge?
+ AlternationHasComment,
+ AlternationHasMalformedCondition, // *maybe? No tests, code never hits
+ AlternationHasMalformedReference, // like @"(x)(?(3x|y)" (note that @"(x)(?(3)x|y)" gives next error)
+ AlternationHasNamedCapture, // like @"(?(?<x>)true|false)"
+ AlternationHasTooManyConditions, // like @"(?(foo)a|b|c)"
+ AlternationHasUndefinedReference, // like @"(x)(?(3)x|y)" or @"(?(1))"
+ CaptureGroupNameInvalid, // like @"(?< >)" or @"(?'x)"
+ CaptureGroupOfZero, // like @"(?'0'foo)" or @("(?<0>x)"
+ ExclusionGroupNotLast, // like @"[a-z-[xy]A]"
+ InsufficientClosingParentheses, // like @"(((foo))"
+ InsufficientOpeningParentheses, // like @"((foo)))"
+ InsufficientOrInvalidHexDigits, // like @"\uabc" or @"\xr"
+ InvalidGroupingConstruct, // like @"(?" or @"(?<foo"
+ InvalidUnicodePropertyEscape, // like @"\p{Ll" or @"\p{ L}"
+ MalformedNamedReference, // like @"\k<"
+ MalformedUnicodePropertyEscape, // like @"\p{}" or @"\p {L}"
+ MissingControlCharacter, // like @"\c"
+ NestedQuantifiersNotParenthesized // @"abc**"
+ QuantifierAfterNothing, // like @"((*foo)bar)"
+ QuantifierOrCaptureGroupOutOfRange,// like @"x{234567899988}" or @"x(?<234567899988>)" (must be < Int32.MaxValue)
+ ReversedCharacterRange, // like @"[z-a]" (only in char classes, see also ReversedQuantifierRange)
+ ReversedQuantifierRange, // like @"abc{3,0}" (only in quantifiers, see also ReversedCharacterRange)
+ ShorthandClassInCharacterRange, // like @"[a-\w]" or @"[a-\p{L}]"
+ UndefinedNamedReference, // like @"\k<x>"
+ UndefinedNumberedReference, // like @"(x)\2"
+ UnescapedEndingBackslash, // like @"foo\" or @"bar\\\\\"
+ UnrecognizedControlCharacter, // like @"\c!"
+ UnrecognizedEscape, // like @"\C" or @"\k<" or @"[\B]"
+ UnrecognizedUnicodeProperty, // like @"\p{Lll}"
+ UnterminatedBracket, // like @"[a-b"
+ UnterminatedComment,
}
* About IllegalCondition
, this is thrown inside a conditional alternation like (?(foo)x|y)
, but appears to never be hit. There is no test case covering this error.
Usage Examples
Here's an example where we use the additional info to give more detailed feedback to the user:
public class TestRE
{
public static Regex CreateAndLog(string regex)
{
try
{
var re = new Regex(regex);
return re;
}
catch(RegexParseException reExc)
{
switch(reExc.Error)
{
case RegexParseError.TooFewHex:
Console.WriteLine("The hexadecimal escape contains not enough hex characters.");
break;
case RegexParseError.UndefinedBackref:
Console.WriteLine("Back-reference in position {0} does not match any captures.", reExc.Offset);
break;
case RegexParseError.UnknownUnicodeProperty:
Console.WriteLine("Error at {0}. Unicode properties must exist, see http://aka.ms/xxx for a list of allowed properties.", reExc.Offset);
break;
// ... etc
}
return null;
}
}
}
Alternative Designs
Alternatively, we may remove the type entirely and merely throw an ArgumentException
. But it is likely that some people rely on the internal type, even though it isn't public, as through reflection the contextual information can be reached and is probably used in regex libraries. Besides, removing it will make any future improvements in dealing with parsing errors and proposing fixes in GUIs much harder to do.
Risks
The only risk I can think of is that after exposing this exception, people would like even more details. But that's probably a good thing and only improves the existing API.
Note that:
- Existing code that checks for
ArgumentException
continues to work. - While debugging, people who see the underlying exception type can now actually use it.
- Existing code using reflection to get to the extra data may or may not not continue to work, it depends on how strict the search for the type is done.
[danmose: made some more edits]