Skip to content

Conversation

stephentoub
Copy link
Member

Today we only support conversion of loops to atomic loops for single-character loops (e.g. a* or [abc]*). This PR augments the logic to support arbitrary loops, enabling many more loops to become atomic.

Copy link
Contributor

Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions
See info in area-owners.md if you want to be subscribed.

@stephentoub
Copy link
Member Author

@MihuBot regexdiff

@MihuBot
Copy link

MihuBot commented Jul 23, 2025

699 out of 18857 patterns have generated source code changes.

Examples of GeneratedRegex source diffs
"(?<lang>[a-z]{2,8})(?:(?:\\-(?<script>[a-zA- ..." (5593 uses)
[GeneratedRegex("(?<lang>[a-z]{2,8})(?:(?:\\-(?<script>[a-zA-Z]+))?\\-(?<reg>[A-Z]+))?", RegexOptions.IgnoreCase | RegexOptions.Singleline | RegexOptions.CultureInvariant)]
  ///     ○ Optional (greedy).<br/>
  ///         ○ Match '-'.<br/>
  ///         ○ "script" capture group.<br/>
-   ///             ○ Match a character in the set [A-Za-z\u212A] greedily at least once.<br/>
+   ///             ○ Match a character in the set [A-Za-z\u212A] atomically at least once.<br/>
  ///     ○ Match '-'.<br/>
  ///     ○ "reg" capture group.<br/>
  ///         ○ Match a character in the set [A-Za-z\u212A] atomically at least once.<br/>
                  int capture_starting_pos = 0;
                  int charloop_capture_pos = 0;
                  int charloop_starting_pos = 0, charloop_ending_pos = 0;
-                   int charloop_starting_pos1 = 0, charloop_ending_pos1 = 0;
                  int loop_iteration = 0;
                  int loop_iteration1 = 0;
                  int stackpos = 0;
                              }
                              
                              // "script" capture group.
-                               //{
+                               {
                                  pos++;
                                  slice = inputSpan.Slice(pos);
                                  int capture_starting_pos1 = pos;
                                  
-                                   // Match a character in the set [A-Za-z\u212A] greedily at least once.
-                                   //{
-                                       charloop_starting_pos1 = pos;
-                                       
+                                   // Match a character in the set [A-Za-z\u212A] atomically at least once.
+                                   {
                                      int iteration1 = slice.IndexOfAnyExcept(Utilities.s_asciiLettersAndKelvinSign);
                                      if (iteration1 < 0)
                                      {
                                      
                                      slice = slice.Slice(iteration1);
                                      pos += iteration1;
-                                       
-                                       charloop_ending_pos1 = pos;
-                                       charloop_starting_pos1++;
-                                       goto CharLoopEnd1;
-                                       
-                                       CharLoopBacktrack1:
-                                       UncaptureUntil(base.runstack![--stackpos]);
-                                       Utilities.StackPop(base.runstack!, ref stackpos, out charloop_ending_pos1, out charloop_starting_pos1);
-                                       
-                                       if (Utilities.s_hasTimeout)
-                                       {
-                                           base.CheckTimeout();
-                                       }
-                                       
-                                       if (charloop_starting_pos1 >= charloop_ending_pos1)
-                                       {
-                                           goto LoopIterationNoMatch1;
-                                       }
-                                       pos = --charloop_ending_pos1;
-                                       slice = inputSpan.Slice(pos);
-                                       
-                                       CharLoopEnd1:
-                                       Utilities.StackPush(ref base.runstack!, ref stackpos, charloop_starting_pos1, charloop_ending_pos1, base.Crawlpos());
-                                   //}
+                                   }
                                  
                                  base.Capture(2, capture_starting_pos1, pos);
-                                   
-                                   Utilities.StackPush(ref base.runstack!, ref stackpos, capture_starting_pos1);
-                                   goto CaptureSkipBacktrack1;
-                                   
-                                   CaptureBacktrack1:
-                                   capture_starting_pos1 = base.runstack![--stackpos];
-                                   goto CharLoopBacktrack1;
-                                   
-                                   CaptureSkipBacktrack1:;
-                               //}
+                               }
                              
                              
                              // The loop has an upper bound of 1. Continue iterating greedily if it hasn't yet been reached.
                              pos = base.runstack![--stackpos];
                              UncaptureUntil(base.runstack![--stackpos]);
                              slice = inputSpan.Slice(pos);
-                               goto LoopEnd1;
-                               
-                               LoopBacktrack:
-                               if (Utilities.s_hasTimeout)
-                               {
-                                   base.CheckTimeout();
-                               }
-                               
-                               if (loop_iteration1 == 0)
-                               {
-                                   // No iterations of the loop remain to backtrack into. Fail the loop.
-                                   goto LoopIterationNoMatch;
-                               }
-                               goto CaptureBacktrack1;
                              LoopEnd1:
                              
                              Utilities.StackPush(ref base.runstack!, ref stackpos, loop_iteration1);
                              goto LoopSkipBacktrack;
                              
-                               LoopBacktrack1:
+                               LoopBacktrack:
                              loop_iteration1 = base.runstack![--stackpos];
                              if (Utilities.s_hasTimeout)
                              {
                                  base.CheckTimeout();
                              }
                              
-                               goto LoopBacktrack;
+                               goto LoopIterationNoMatch1;
                              
                              LoopSkipBacktrack:;
                          //}
                          // Match '-'.
                          if (slice.IsEmpty || slice[0] != '-')
                          {
-                               goto LoopBacktrack1;
+                               goto LoopBacktrack;
                          }
                          
                          // "reg" capture group.
                                  
                                  if (iteration2 == 0)
                                  {
-                                       goto LoopBacktrack1;
+                                       goto LoopBacktrack;
                                  }
                                  
                                  slice = slice.Slice(iteration2);
      /// <summary>Whether <see cref="s_defaultTimeout"/> is non-infinite.</summary>
      internal static readonly bool s_hasTimeout = s_defaultTimeout != Regex.InfiniteMatchTimeout;
      
-       /// <summary>Pops 2 values from the backtracking stack.</summary>
-       [MethodImpl(MethodImplOptions.AggressiveInlining)]
-       internal static void StackPop(int[] stack, ref int pos, out int arg0, out int arg1)
-       {
-           arg0 = stack[--pos];
-           arg1 = stack[--pos];
-       }
-       
      /// <summary>Pushes 1 value onto the backtracking stack.</summary>
      [MethodImpl(MethodImplOptions.AggressiveInlining)]
      internal static void StackPush(ref int[] stack, ref int pos, int arg0)
          }
      }
      
-       /// <summary>Pushes 3 values onto the backtracking stack.</summary>
-       [MethodImpl(MethodImplOptions.AggressiveInlining)]
-       internal static void StackPush(ref int[] stack, ref int pos, int arg0, int arg1, int arg2)
-       {
-           // If there's space available for all 3 values, store them.
-           int[] s = stack;
-           int p = pos;
-           if ((uint)(p + 2) < (uint)s.Length)
-           {
-               s[p] = arg0;
-               s[p + 1] = arg1;
-               s[p + 2] = arg2;
-               pos += 3;
-               return;
-           }
-       
-           // Otherwise, resize the stack to make room and try again.
-           WithResize(ref stack, ref pos, arg0, arg1, arg2);
-       
-           // <summary>Resize the backtracking stack array and push 3 values onto the stack.</summary>
-           [MethodImpl(MethodImplOptions.NoInlining)]
-           static void WithResize(ref int[] stack, ref int pos, int arg0, int arg1, int arg2)
-           {
-               Array.Resize(ref stack, (pos + 2) * 2);
-               StackPush(ref stack, ref pos, arg0, arg1, arg2);
-           }
-       }
-       
      /// <summary>Supports searching for characters in or not in "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzK".</summary>
      internal static readonly SearchValues<char> s_asciiLettersAndKelvinSign = SearchValues.Create("ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzK");
  }
"^((([a-z]|\\d|[!#\\$%&'\\*\\+\\-\\/=\\?\\^_` ..." (4566 uses)
[GeneratedRegex("^((([a-z]|\\d|[!#\\$%&'\\*\\+\\-\\/=\\?\\^_`{\\|}~]|[\\u00A0-\\uD7FF\\uF900-\\uFDCF\\uFDF0-\\uFFEF])+(\\.([a-z]|\\d|[!#\\$%&'\\*\\+\\-\\/=\\?\\^_`{\\|}~]|[\\u00A0-\\uD7FF\\uF900-\\uFDCF\\uFDF0-\\uFFEF])+)*)|((\\x22)((((\\x20|\\x09)*(\\x0d\\x0a))?(\\x20|\\x09)+)?(([\\x01-\\x08\\x0b\\x0c\\x0e-\\x1f\\x7f]|\\x21|[\\x23-\\x5b]|[\\x5d-\\x7e]|[\\u00A0-\\uD7FF\\uF900-\\uFDCF\\uFDF0-\\uFFEF])|(\\\\([\\x01-\\x09\\x0b\\x0c\\x0d-\\x7f]|[\\u00A0-\\uD7FF\\uF900-\\uFDCF\\uFDF0-\\uFFEF]))))*(((\\x20|\\x09)*(\\x0d\\x0a))?(\\x20|\\x09)+)?(\\x22)))@((([a-z]|\\d|[\\u00A0-\\uD7FF\\uF900-\\uFDCF\\uFDF0-\\uFFEF])|(([a-z]|\\d|[\\u00A0-\\uD7FF\\uF900-\\uFDCF\\uFDF0-\\uFFEF])([a-z]|\\d|-|\\.|_|~|[\\u00A0-\\uD7FF\\uF900-\\uFDCF\\uFDF0-\\uFFEF])*([a-z]|\\d|[\\u00A0-\\uD7FF\\uF900-\\uFDCF\\uFDF0-\\uFFEF])))\\.)+(([a-z]|[\\u00A0-\\uD7FF\\uF900-\\uFDCF\\uFDF0-\\uFFEF])|(([a-z]|[\\u00A0-\\uD7FF\\uF900-\\uFDCF\\uFDF0-\\uFFEF])([a-z]|\\d|-|\\.|_|~|[\\u00A0-\\uD7FF\\uF900-\\uFDCF\\uFDF0-\\uFFEF])*([a-z]|[\\u00A0-\\uD7FF\\uF900-\\uFDCF\\uFDF0-\\uFFEF])))\\.?$", RegexOptions.IgnoreCase | RegexOptions.ExplicitCapture)]
  ///                 ○ Optional (greedy).<br/>
  ///                     ○ Match a character in the set [\t ] atomically any number of times.<br/>
  ///                     ○ Match the string "\r\n".<br/>
-   ///                 ○ Match a character in the set [\t ] greedily at least once.<br/>
+   ///                 ○ Match a character in the set [\t ] atomically at least once.<br/>
  ///             ○ Match with 2 alternative expressions.<br/>
  ///                 ○ Match a character in the set [\u0001-\b\v\f\u000E-\u001F!#-[]-\u007F\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF].<br/>
  ///                 ○ Match a sequence of expressions.<br/>
  ///             ○ Optional (greedy).<br/>
  ///                 ○ Match a character in the set [\t ] atomically any number of times.<br/>
  ///                 ○ Match the string "\r\n".<br/>
-   ///             ○ Match a character in the set [\t ] greedily at least once.<br/>
+   ///             ○ Match a character in the set [\t ] atomically at least once.<br/>
  ///         ○ Match '"'.<br/>
  /// ○ Match '@'.<br/>
  /// ○ Loop greedily at least once.<br/>
                  int charloop_starting_pos = 0, charloop_ending_pos = 0;
                  int charloop_starting_pos1 = 0, charloop_ending_pos1 = 0;
                  int charloop_starting_pos2 = 0, charloop_ending_pos2 = 0;
-                   int charloop_starting_pos3 = 0, charloop_ending_pos3 = 0;
-                   int charloop_starting_pos4 = 0, charloop_ending_pos4 = 0;
                  int loop_iteration = 0;
                  int loop_iteration1 = 0;
                  int loop_iteration2 = 0;
                                      LoopSkipBacktrack:;
                                  //}
                                  
-                                   // Match a character in the set [\t ] greedily at least once.
-                                   //{
-                                       charloop_starting_pos1 = pos;
-                                       
+                                   // Match a character in the set [\t ] atomically at least once.
+                                   {
                                      int iteration3 = slice.IndexOfAnyExcept('\t', ' ');
                                      if (iteration3 < 0)
                                      {
                                      
                                      slice = slice.Slice(iteration3);
                                      pos += iteration3;
-                                       
-                                       charloop_ending_pos1 = pos;
-                                       charloop_starting_pos1++;
-                                       goto CharLoopEnd1;
-                                       
-                                       CharLoopBacktrack1:
-                                       Utilities.StackPop(base.runstack!, ref stackpos, out charloop_ending_pos1, out charloop_starting_pos1);
-                                       
-                                       if (Utilities.s_hasTimeout)
-                                       {
-                                           base.CheckTimeout();
-                                       }
-                                       
-                                       if (charloop_starting_pos1 >= charloop_ending_pos1)
-                                       {
-                                           goto LoopBacktrack;
-                                       }
-                                       pos = --charloop_ending_pos1;
-                                       slice = inputSpan.Slice(pos);
-                                       
-                                       CharLoopEnd1:
-                                       Utilities.StackPush(ref base.runstack!, ref stackpos, charloop_starting_pos1, charloop_ending_pos1);
-                                   //}
+                                   }
                                  
                                  
                                  // The loop has an upper bound of 1. Continue iterating greedily if it hasn't yet been reached.
                                      // No iterations of the loop remain to backtrack into. Fail the loop.
                                      goto LoopIterationNoMatch1;
                                  }
-                                   goto CharLoopBacktrack1;
+                                   goto LoopBacktrack;
                                  LoopEnd2:
                                  
                                  Utilities.StackPush(ref base.runstack!, ref stackpos, loop_iteration2);
                                  LoopSkipBacktrack2:;
                              //}
                              
-                               // Match a character in the set [\t ] greedily at least once.
-                               //{
-                                   charloop_starting_pos2 = pos;
-                                   
+                               // Match a character in the set [\t ] atomically at least once.
+                               {
                                  int iteration5 = slice.IndexOfAnyExcept('\t', ' ');
                                  if (iteration5 < 0)
                                  {
                                  
                                  slice = slice.Slice(iteration5);
                                  pos += iteration5;
-                                   
-                                   charloop_ending_pos2 = pos;
-                                   charloop_starting_pos2++;
-                                   goto CharLoopEnd2;
-                                   
-                                   CharLoopBacktrack2:
-                                   Utilities.StackPop(base.runstack!, ref stackpos, out charloop_ending_pos2, out charloop_starting_pos2);
-                                   
-                                   if (Utilities.s_hasTimeout)
-                                   {
-                                       base.CheckTimeout();
-                                   }
-                                   
-                                   if (charloop_starting_pos2 >= charloop_ending_pos2)
-                                   {
-                                       goto LoopBacktrack4;
-                                   }
-                                   pos = --charloop_ending_pos2;
-                                   slice = inputSpan.Slice(pos);
-                                   
-                                   CharLoopEnd2:
-                                   Utilities.StackPush(ref base.runstack!, ref stackpos, charloop_starting_pos2, charloop_ending_pos2);
-                               //}
+                               }
                              
                              
                              // The loop has an upper bound of 1. Continue iterating greedily if it hasn't yet been reached.
                                  // No iterations of the loop remain to backtrack into. Fail the loop.
                                  goto LoopBacktrack3;
                              }
-                               goto CharLoopBacktrack2;
+                               goto LoopBacktrack4;
                              LoopEnd4:;
                          //}
                          
                              //{
                                  pos++;
                                  slice = inputSpan.Slice(pos);
-                                   charloop_starting_pos3 = pos;
+                                   charloop_starting_pos1 = pos;
                                  
                                  int iteration6 = 0;
                                  while ((uint)iteration6 < (uint)slice.Length && ((ch = slice[iteration6]) < 128 ? ("\0\0怀Ͽ\ufffe蟿\ufffe䟿"[ch >> 4] & (1 << (ch & 0xF))) != 0 : RegexRunner.CharInClass((char)ch, "\0\u0010\u0001-/A[_`a{~\u007f \ud800豈\ufdd0ﷰ\ufff0\t")))
                                  slice = slice.Slice(iteration6);
                                  pos += iteration6;
                                  
-                                   charloop_ending_pos3 = pos;
-                                   goto CharLoopEnd3;
+                                   charloop_ending_pos1 = pos;
+                                   goto CharLoopEnd1;
                                  
-                                   CharLoopBacktrack3:
-                                   Utilities.StackPop(base.runstack!, ref stackpos, out charloop_ending_pos3, out charloop_starting_pos3);
+                                   CharLoopBacktrack1:
+                                   Utilities.StackPop(base.runstack!, ref stackpos, out charloop_ending_pos1, out charloop_starting_pos1);
                                  
                                  if (Utilities.s_hasTimeout)
                                  {
                                      base.CheckTimeout();
                                  }
                                  
-                                   if (charloop_starting_pos3 >= charloop_ending_pos3)
+                                   if (charloop_starting_pos1 >= charloop_ending_pos1)
                                  {
                                      goto LoopIterationNoMatch6;
                                  }
-                                   pos = --charloop_ending_pos3;
+                                   pos = --charloop_ending_pos1;
                                  slice = inputSpan.Slice(pos);
                                  
-                                   CharLoopEnd3:
-                                   Utilities.StackPush(ref base.runstack!, ref stackpos, charloop_starting_pos3, charloop_ending_pos3);
+                                   CharLoopEnd1:
+                                   Utilities.StackPush(ref base.runstack!, ref stackpos, charloop_starting_pos1, charloop_ending_pos1);
                              //}
                              
                              // Match a character in the set [A-Za-z\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF\d].
                              if (slice.IsEmpty || ((ch = slice[0]) < 128 ? !char.IsAsciiLetterOrDigit(ch) : !RegexRunner.CharInClass((char)ch, "\0\n\u0001A[a{ \ud800豈\ufdd0ﷰ\ufff0\t")))
                              {
-                                   goto CharLoopBacktrack3;
+                                   goto CharLoopBacktrack1;
                              }
                              
                              Utilities.StackPush(ref base.runstack!, ref stackpos, 1, alternation_starting_pos2);
                              case 0:
                                  goto AlternationBranch2;
                              case 1:
-                                   goto CharLoopBacktrack3;
+                                   goto CharLoopBacktrack1;
                          }
                          
                          AlternationMatch2:;
                          //{
                              pos++;
                              slice = inputSpan.Slice(pos);
-                               charloop_starting_pos4 = pos;
+                               charloop_starting_pos2 = pos;
                              
                              int iteration7 = 0;
                              while ((uint)iteration7 < (uint)slice.Length && ((ch = slice[iteration7]) < 128 ? ("\0\0怀Ͽ\ufffe蟿\ufffe䟿"[ch >> 4] & (1 << (ch & 0xF))) != 0 : RegexRunner.CharInClass((char)ch, "\0\u0010\u0001-/A[_`a{~\u007f \ud800豈\ufdd0ﷰ\ufff0\t")))
                              slice = slice.Slice(iteration7);
                              pos += iteration7;
                              
-                               charloop_ending_pos4 = pos;
-                               goto CharLoopEnd4;
+                               charloop_ending_pos2 = pos;
+                               goto CharLoopEnd2;
                              
-                               CharLoopBacktrack4:
+                               CharLoopBacktrack2:
                              
                              if (Utilities.s_hasTimeout)
                              {
                                  base.CheckTimeout();
                              }
                              
-                               if (charloop_starting_pos4 >= charloop_ending_pos4)
+                               if (charloop_starting_pos2 >= charloop_ending_pos2)
                              {
                                  goto LoopBacktrack6;
                              }
-                               pos = --charloop_ending_pos4;
+                               pos = --charloop_ending_pos2;
                              slice = inputSpan.Slice(pos);
                              
-                               CharLoopEnd4:
+                               CharLoopEnd2:
                          //}
                          
                          // Match a character in the set [A-Za-z\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF].
                          if (slice.IsEmpty || ((ch = slice[0]) < 128 ? !char.IsAsciiLetter(ch) : !RegexRunner.CharInClass((char)ch, "\0\n\0A[a{ \ud800豈\ufdd0ﷰ\ufff0")))
                          {
-                               goto CharLoopBacktrack4;
+                               goto CharLoopBacktrack2;
                          }
                          
                          alternation_branch1 = 1;
                          case 0:
                              goto AlternationBranch3;
                          case 1:
-                               goto CharLoopBacktrack4;
+                               goto CharLoopBacktrack2;
                      }
                      
                      AlternationMatch3:;
"{(?<env>env:)??\\w+(\\s+(\\?\\?)??\\s+\\w+)??}" (2282 uses)
[GeneratedRegex("{(?<env>env:)??\\w+(\\s+(\\?\\?)??\\s+\\w+)??}")]
  ///             ○ 2nd capture group.<br/>
  ///                 ○ Match the string "??".<br/>
  ///         ○ Match a whitespace character atomically at least once.<br/>
-   ///         ○ Match a word character greedily at least once.<br/>
+   ///         ○ Match a word character atomically at least once.<br/>
  /// ○ Match '}'.<br/>
  /// </code>
  /// </remarks>
                  int charloop_capture_pos = 0;
                  int charloop_starting_pos = 0, charloop_ending_pos = 0;
                  int charloop_starting_pos1 = 0, charloop_ending_pos1 = 0;
-                   int charloop_starting_pos2 = 0, charloop_ending_pos2 = 0;
                  int lazyloop_iteration = 0;
                  int lazyloop_iteration1 = 0;
                  int lazyloop_iteration2 = 0;
                              pos += iteration2;
                          }
                          
-                           // Match a word character greedily at least once.
-                           //{
-                               charloop_starting_pos2 = pos;
-                               
+                           // Match a word character atomically at least once.
+                           {
                              int iteration3 = 0;
                              while ((uint)iteration3 < (uint)slice.Length && Utilities.IsWordChar(slice[iteration3]))
                              {
                              
                              slice = slice.Slice(iteration3);
                              pos += iteration3;
-                               
-                               charloop_ending_pos2 = pos;
-                               charloop_starting_pos2++;
-                               goto CharLoopEnd2;
-                               
-                               CharLoopBacktrack2:
-                               UncaptureUntil(base.runstack![--stackpos]);
-                               Utilities.StackPop(base.runstack!, ref stackpos, out charloop_ending_pos2, out charloop_starting_pos2);
-                               
-                               if (Utilities.s_hasTimeout)
-                               {
-                                   base.CheckTimeout();
-                               }
-                               
-                               if (charloop_starting_pos2 >= charloop_ending_pos2)
-                               {
-                                   goto LazyLoopBacktrack1;
-                               }
-                               pos = --charloop_ending_pos2;
-                               slice = inputSpan.Slice(pos);
-                               
-                               CharLoopEnd2:
-                               Utilities.StackPush(ref base.runstack!, ref stackpos, charloop_starting_pos2, charloop_ending_pos2, base.Crawlpos());
-                           //}
+                           }
                          
                          base.Capture(1, capture_starting_pos1, pos);
                          
                          
                          CaptureBacktrack:
                          capture_starting_pos1 = base.runstack![--stackpos];
-                           goto CharLoopBacktrack2;
+                           goto LazyLoopBacktrack1;
                          
                          CaptureSkipBacktrack:;
                      //}
"(<br>)+$" (119 uses)
[GeneratedRegex("(<br>)+$", RegexOptions.IgnoreCase | RegexOptions.Multiline)]
  /// <code>RegexOptions.IgnoreCase | RegexOptions.Multiline</code><br/>
  /// Explanation:<br/>
  /// <code>
-   /// ○ Loop greedily at least once.<br/>
+   /// ○ Loop greedily and atomically at least once.<br/>
  ///     ○ 1st capture group.<br/>
  ///         ○ Match '&lt;'.<br/>
  ///         ○ Match a character in the set [Bb].<br/>
                  int matchStart = pos;
                  int loop_iteration = 0;
                  int stackpos = 0;
+                   int startingStackpos = 0;
                  ReadOnlySpan<char> slice = inputSpan.Slice(pos);
                  
-                   // Loop greedily at least once.
-                   //{
+                   // Loop greedily and atomically at least once.
+                   {
+                       startingStackpos = stackpos;
                      loop_iteration = 0;
                      
                      LoopBody:
                          return false; // The input didn't match.
                      }
                      
-                       LoopEnd:;
-                   //}
+                       LoopEnd:
+                       stackpos = startingStackpos; // Ensure any remaining backtracking state is removed.
+                   }
                  
                  // Match if at the end of a line.
                  if ((uint)pos < (uint)inputSpan.Length && inputSpan[pos] != '\n')
                  {
-                       goto LoopIterationNoMatch;
+                       UncaptureUntil(0);
+                       return false; // The input didn't match.
                  }
                  
                  // The input matched.

For more diff examples, see https://gist.github.com/MihuBot/e5abb03a950ee125ee5dd26759e09907

Total bytes of base: 54706120
Total bytes of diff: 54299077
Total bytes of delta: -407043 (-0.74 % of base)
Total relative delta: -62.88
    diff is an improvement.
    relative diff is an improvement.

For a list of JIT diff regressions, see Regressions.md
For a list of JIT diff improvements, see Improvements.md

Sample source code for further analysis
const string JsonPath = "RegexResults-1267.json";
if (!File.Exists(JsonPath))
{
    await using var archiveStream = await new HttpClient().GetStreamAsync("https://mihubot.xyz/r/E2V7C4lA");
    using var archive = new ZipArchive(archiveStream, ZipArchiveMode.Read);
    archive.Entries.First(e => e.Name == "Results.json").ExtractToFile(JsonPath);
}

using FileStream jsonFileStream = File.OpenRead(JsonPath);
RegexEntry[] entries = JsonSerializer.Deserialize<RegexEntry[]>(jsonFileStream, new JsonSerializerOptions { IncludeFields = true })!;
Console.WriteLine($"Working with {entries.Length} patterns");



record KnownPattern(string Pattern, RegexOptions Options, int Count);

sealed class RegexEntry
{
    public required KnownPattern Regex { get; set; }
    public required string MainSource { get; set; }
    public required string PrSource { get; set; }
    public string? FullDiff { get; set; }
    public string? ShortDiff { get; set; }
    public (string Name, string Values)[]? SearchValuesOfChar { get; set; }
    public (string[] Values, StringComparison ComparisonType)[]? SearchValuesOfString { get; set; }
}

@stephentoub stephentoub marked this pull request as ready for review July 23, 2025 19:23
@stephentoub stephentoub requested review from Copilot and MihaZupan and removed request for Copilot July 23, 2025 19:23
@dotnet dotnet deleted a comment from MihuBot Jul 23, 2025
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR extends the regular expression engine's auto-atomicity optimization from single-character loops (like a* or [abc]*) to support arbitrary loops (like (abc)* or (abcd*)+). The enhancement analyzes the first and last character classes of complex loop patterns to determine when they can be safely converted to atomic loops, which improves performance by eliminating unnecessary backtracking.

Key changes include:

  • Extended atomicity analysis to support general loops and lazy loops with comprehensive character class overlap detection
  • Enhanced prefix analyzer to compute both first and last character classes for regex nodes
  • Updated test coverage to verify the new atomic loop optimizations work correctly

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File Description
RegexReductionTests.cs Added test cases for general loop atomicity and includes commented-out debug test code
Regex.Match.Tests.cs Added functional tests for nested loop patterns with various lazy/greedy combinations
RegexPrefixAnalyzer.cs Enhanced to support finding last character classes and refactored existing first character class logic
RegexNode.cs Significantly expanded loop atomicity logic to handle arbitrary loops with character class overlap analysis
Comments suppressed due to low confidence (2)

src/libraries/System.Text.RegularExpressions/tests/UnitTests/RegexReductionTests.cs:234

  • The test case for lazy loop '(abc)*?' expects an empty string result, but this seems unclear. Consider adding a comment explaining why this specific pattern reduces to empty, or verify this is the intended behavior.
        [InlineData("(abc)*?", "")]

src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexNode.cs:965

  • The goto statement on line 1967 ('goto case RegexNodeKind.Loop;') creates unclear control flow. Consider refactoring this logic into a separate method to improve readability and maintainability.
                    break;

@stephentoub
Copy link
Member Author

@MihuBot regexdiff

@MihuBot
Copy link

MihuBot commented Jul 23, 2025

1464 out of 18857 patterns have generated source code changes.

Examples of GeneratedRegex source diffs
"^\\s*(((?<ORIGIN>(((\\d+>)?[a-zA-Z]?:[^:]*)| ..." (7826 uses)
[GeneratedRegex("^\\s*(((?<ORIGIN>(((\\d+>)?[a-zA-Z]?:[^:]*)|([^:]*))):)|())(?<SUBCATEGORY>(()|([^:]*? )))(?<CATEGORY>(error|warning))( \\s*(?<CODE>[^: ]*))?\\s*:(?<TEXT>.*)$", RegexOptions.IgnoreCase)]
  int loop_iteration = 0;
  int loop_iteration1 = 0;
  int stackpos = 0;
+   int startingStackpos = 0;
  ReadOnlySpan<char> slice = inputSpan.Slice(pos);
  
  // Match if at the beginning of the string.
                              // Branch 0
                              //{
                                  // 4th capture group.
-                                   //{
+                                   {
                                      capture_starting_pos4 = pos;
                                      
                                      // Optional (greedy).
-                                       //{
+                                       {
+                                           startingStackpos = stackpos;
                                          loop_iteration = 0;
                                          
                                          LoopBody:
                                          pos = base.runstack![--stackpos];
                                          UncaptureUntil(base.runstack![--stackpos]);
                                          slice = inputSpan.Slice(pos);
-                                           LoopEnd:;
-                                       //}
+                                           LoopEnd:
+                                           stackpos = startingStackpos; // Ensure any remaining backtracking state is removed.
+                                       }
                                      
                                      // Match a character in the set [A-Za-z\u212A] atomically, optionally.
                                      {
                                      // Match ':'.
                                      if (slice.IsEmpty || slice[0] != ':')
                                      {
-                                           goto LoopIterationNoMatch;
+                                           goto AlternationBranch1;
                                      }
                                      
                                      // Match a character other than ':' atomically any number of times.
                                      pos++;
                                      slice = inputSpan.Slice(pos);
                                      base.Capture(4, capture_starting_pos4, pos);
-                                       
-                                       goto CaptureSkipBacktrack;
-                                       
-                                       CaptureBacktrack:
-                                       goto LoopIterationNoMatch;
-                                       
-                                       CaptureSkipBacktrack:;
-                                   //}
+                                   }
                                  
                                  alternation_branch1 = 0;
                                  goto AlternationMatch1;
                              switch (alternation_branch1)
                              {
                                  case 0:
-                                       goto CaptureBacktrack;
+                                       goto AlternationBranch1;
                                  case 1:
                                      goto AlternationBranch;
                              }
                          
                          base.Capture(3, capture_starting_pos3, pos);
                          
-                           goto CaptureSkipBacktrack1;
+                           goto CaptureSkipBacktrack;
                          
-                           CaptureBacktrack1:
+                           CaptureBacktrack:
                          goto AlternationBacktrack1;
                          
-                           CaptureSkipBacktrack1:;
+                           CaptureSkipBacktrack:;
                      //}
                      
                      base.Capture(13, capture_starting_pos2, pos);
                      
-                       goto CaptureSkipBacktrack2;
+                       goto CaptureSkipBacktrack1;
                      
-                       CaptureBacktrack2:
-                       goto CaptureBacktrack1;
+                       CaptureBacktrack1:
+                       goto CaptureBacktrack;
                      
-                       CaptureSkipBacktrack2:;
+                       CaptureSkipBacktrack1:;
                  //}
                  
                  // Match ':'.
                  if (slice.IsEmpty || slice[0] != ':')
                  {
-                       goto CaptureBacktrack2;
+                       goto CaptureBacktrack1;
                  }
                  
                  pos++;
                  slice = inputSpan.Slice(pos);
                  base.Capture(2, capture_starting_pos1, pos);
                  
-                   goto CaptureSkipBacktrack3;
+                   goto CaptureSkipBacktrack2;
                  
-                   CaptureBacktrack3:
-                   goto CaptureBacktrack2;
+                   CaptureBacktrack2:
+                   goto CaptureBacktrack1;
                  
-                   CaptureSkipBacktrack3:;
+                   CaptureSkipBacktrack2:;
              //}
              
              alternation_branch = 0;
          switch (alternation_branch)
          {
              case 0:
-                   goto CaptureBacktrack3;
+                   goto CaptureBacktrack2;
              case 1:
                  goto CharLoopBacktrack;
          }
      
      base.Capture(1, capture_starting_pos, pos);
      
-       goto CaptureSkipBacktrack4;
+       goto CaptureSkipBacktrack3;
      
-       CaptureBacktrack4:
+       CaptureBacktrack3:
      goto AlternationBacktrack;
      
-       CaptureSkipBacktrack4:;
+       CaptureSkipBacktrack3:;
  //}
  
  // "SUBCATEGORY" capture group.
                          slice = inputSpan.Slice(pos);
                          if (slice.IsEmpty || slice[0] == ':')
                          {
-                               goto CaptureBacktrack4;
+                               goto CaptureBacktrack3;
                          }
                          pos++;
                          slice = inputSpan.Slice(pos);
                          lazyloop_pos = slice.IndexOfAny(':', ' ');
                          if ((uint)lazyloop_pos >= (uint)slice.Length || slice[lazyloop_pos] == ':')
                          {
-                               goto CaptureBacktrack4;
+                               goto CaptureBacktrack3;
                          }
                          pos += lazyloop_pos;
                          slice = inputSpan.Slice(pos);
                      slice = inputSpan.Slice(pos);
                      base.Capture(10, capture_starting_pos11, pos);
                      
-                       goto CaptureSkipBacktrack5;
+                       goto CaptureSkipBacktrack4;
                      
-                       CaptureBacktrack5:
+                       CaptureBacktrack4:
                      goto LazyLoopBacktrack;
                      
-                       CaptureSkipBacktrack5:;
+                       CaptureSkipBacktrack4:;
                  //}
                  
                  alternation_branch2 = 1;
                  case 0:
                      goto AlternationBranch2;
                  case 1:
-                       goto CaptureBacktrack5;
+                       goto CaptureBacktrack4;
              }
              
              AlternationMatch2:;
          
          base.Capture(8, capture_starting_pos9, pos);
          
-           goto CaptureSkipBacktrack6;
+           goto CaptureSkipBacktrack5;
          
-           CaptureBacktrack6:
+           CaptureBacktrack5:
          goto AlternationBacktrack2;
          
-           CaptureSkipBacktrack6:;
+           CaptureSkipBacktrack5:;
      //}
      
      base.Capture(14, capture_starting_pos8, pos);
      
-       goto CaptureSkipBacktrack7;
+       goto CaptureSkipBacktrack6;
      
-       CaptureBacktrack7:
-       goto CaptureBacktrack6;
+       CaptureBacktrack6:
+       goto CaptureBacktrack5;
      
-       CaptureSkipBacktrack7:;
+       CaptureSkipBacktrack6:;
  //}
  
  // "CATEGORY" capture group.
          //{
              if (slice.IsEmpty)
              {
-                   goto CaptureBacktrack7;
+                   goto CaptureBacktrack6;
              }
              
              switch (slice[0])
                      if ((uint)slice.Length < 5 ||
                          !slice.Slice(1).StartsWith("rror", StringComparison.OrdinalIgnoreCase)) // Match the string "rror" (ordinal case-insensitive)
                      {
-                           goto CaptureBacktrack7;
+                           goto CaptureBacktrack6;
                      }
                      
                      pos += 5;
                      if ((uint)slice.Length < 7 ||
                          !slice.Slice(1).StartsWith("arning", StringComparison.OrdinalIgnoreCase)) // Match the string "arning" (ordinal case-insensitive)
                      {
-                           goto CaptureBacktrack7;
+                           goto CaptureBacktrack6;
                      }
                      
                      pos += 7;
                      break;
                      
                  default:
-                       goto CaptureBacktrack7;
+                       goto CaptureBacktrack6;
              }
          //}
          
              base.Capture(16, capture_starting_pos15, pos);
              
              Utilities.StackPush(ref base.runstack!, ref stackpos, capture_starting_pos15);
-               goto CaptureSkipBacktrack8;
+               goto CaptureSkipBacktrack7;
              
-               CaptureBacktrack8:
+               CaptureBacktrack7:
              capture_starting_pos15 = base.runstack![--stackpos];
              goto CharLoopBacktrack2;
              
-               CaptureSkipBacktrack8:;
+               CaptureSkipBacktrack7:;
          //}
          
          base.Capture(12, capture_starting_pos14, pos);
          
          Utilities.StackPush(ref base.runstack!, ref stackpos, capture_starting_pos14);
-           goto CaptureSkipBacktrack9;
+           goto CaptureSkipBacktrack8;
          
-           CaptureBacktrack9:
+           CaptureBacktrack8:
          capture_starting_pos14 = base.runstack![--stackpos];
-           goto CaptureBacktrack8;
+           goto CaptureBacktrack7;
          
-           CaptureSkipBacktrack9:;
+           CaptureSkipBacktrack8:;
      //}
      
      
      if (--loop_iteration1 < 0)
      {
          // Unable to match the remainder of the expression after exhausting the loop.
-           goto CaptureBacktrack7;
+           goto CaptureBacktrack6;
      }
      pos = base.runstack![--stackpos];
      UncaptureUntil(base.runstack![--stackpos]);
      if (loop_iteration1 == 0)
      {
          // No iterations of the loop remain to backtrack into. Fail the loop.
-           goto CaptureBacktrack7;
+           goto CaptureBacktrack6;
      }
-       goto CaptureBacktrack9;
+       goto CaptureBacktrack8;
      LoopEnd1:;
  //}
  
      
      base.Capture(17, capture_starting_pos16, pos);
      
-       goto CaptureSkipBacktrack10;
+       goto CaptureSkipBacktrack9;
      
-       CaptureBacktrack10:
+       CaptureBacktrack9:
      goto CharLoopBacktrack3;
      
-       CaptureSkipBacktrack10:;
+       CaptureSkipBacktrack9:;
  //}
  
  // Match if at the end of the string or if before an ending newline.
  if (pos < inputSpan.Length - 1 || ((uint)pos < (uint)inputSpan.Length && inputSpan[pos] != '\n'))
  {
-       goto CaptureBacktrack10;
+       goto CaptureBacktrack9;
  }
  
  // The input matched.
"\\A(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\\.[a-z ..." (5703 uses)
[GeneratedRegex("\\A(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*@(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?)\\Z", RegexOptions.IgnoreCase | RegexOptions.Singleline | RegexOptions.CultureInvariant)]
  int loop_iteration2 = 0;
  int loop_iteration3 = 0;
  int stackpos = 0;
+   int startingStackpos = 0;
+   int startingStackpos1 = 0;
  ReadOnlySpan<char> slice = inputSpan.Slice(pos);
  
  // Match if at the beginning of the string.
          goto LoopIterationNoMatch1;
      }
      
-       // Optional (greedy).
-       //{
-           pos++;
-           slice = inputSpan.Slice(pos);
-           loop_iteration2 = 0;
+       // Atomic group.
+       {
+           int atomic_stackpos = stackpos;
          
-           LoopBody2:
-           Utilities.StackPush(ref base.runstack!, ref stackpos, pos);
-           
-           loop_iteration2++;
-           
-           // Match a character in the set [\-0-9A-Za-z\u212A] greedily any number of times.
+           // Optional (greedy).
          //{
-               charloop_starting_pos1 = pos;
+               pos++;
+               slice = inputSpan.Slice(pos);
+               startingStackpos = stackpos;
+               loop_iteration2 = 0;
              
-               int iteration2 = slice.IndexOfAnyExcept(Utilities.s_asciiLettersAndDigitsAndDashKelvinSign);
-               if (iteration2 < 0)
+               LoopBody2:
+               Utilities.StackPush(ref base.runstack!, ref stackpos, pos);
+               
+               loop_iteration2++;
+               
+               // Match a character in the set [\-0-9A-Za-z\u212A] greedily any number of times.
+               //{
+                   charloop_starting_pos1 = pos;
+                   
+                   int iteration2 = slice.IndexOfAnyExcept(Utilities.s_asciiLettersAndDigitsAndDashKelvinSign);
+                   if (iteration2 < 0)
+                   {
+                       iteration2 = slice.Length;
+                   }
+                   
+                   slice = slice.Slice(iteration2);
+                   pos += iteration2;
+                   
+                   charloop_ending_pos1 = pos;
+                   goto CharLoopEnd1;
+                   
+                   CharLoopBacktrack1:
+                   Utilities.StackPop(base.runstack!, ref stackpos, out charloop_ending_pos1, out charloop_starting_pos1);
+                   
+                   if (Utilities.s_hasTimeout)
+                   {
+                       base.CheckTimeout();
+                   }
+                   
+                   if (charloop_starting_pos1 >= charloop_ending_pos1 ||
+                       (charloop_ending_pos1 = inputSpan.Slice(charloop_starting_pos1, charloop_ending_pos1 - charloop_starting_pos1).LastIndexOfAny(Utilities.s_asciiLettersAndDigitsAndKelvinSign)) < 0)
+                   {
+                       goto LoopIterationNoMatch2;
+                   }
+                   charloop_ending_pos1 += charloop_starting_pos1;
+                   pos = charloop_ending_pos1;
+                   slice = inputSpan.Slice(pos);
+                   
+                   CharLoopEnd1:
+                   Utilities.StackPush(ref base.runstack!, ref stackpos, charloop_starting_pos1, charloop_ending_pos1);
+               //}
+               
+               // Match a character in the set [0-9A-Za-z\u212A].
+               if (slice.IsEmpty || ((ch = slice[0]) < 128 ? !char.IsAsciiLetterOrDigit(ch) : !RegexRunner.CharInClass((char)ch, "\0\b\00:A[a{KÅ")))
              {
-                   iteration2 = slice.Length;
+                   goto CharLoopBacktrack1;
              }
              
-               slice = slice.Slice(iteration2);
-               pos += iteration2;
-               
-               charloop_ending_pos1 = pos;
-               goto CharLoopEnd1;
-               
-               CharLoopBacktrack1:
-               Utilities.StackPop(base.runstack!, ref stackpos, out charloop_ending_pos1, out charloop_starting_pos1);
-               
-               if (Utilities.s_hasTimeout)
-               {
-                   base.CheckTimeout();
-               }
-               
-               if (charloop_starting_pos1 >= charloop_ending_pos1 ||
-                   (charloop_ending_pos1 = inputSpan.Slice(charloop_starting_pos1, charloop_ending_pos1 - charloop_starting_pos1).LastIndexOfAny(Utilities.s_asciiLettersAndDigitsAndKelvinSign)) < 0)
-               {
-                   goto LoopIterationNoMatch2;
-               }
-               charloop_ending_pos1 += charloop_starting_pos1;
-               pos = charloop_ending_pos1;
+               pos++;
              slice = inputSpan.Slice(pos);
              
-               CharLoopEnd1:
-               Utilities.StackPush(ref base.runstack!, ref stackpos, charloop_starting_pos1, charloop_ending_pos1);
+               // The loop has an upper bound of 1. Continue iterating greedily if it hasn't yet been reached.
+               if (loop_iteration2 == 0)
+               {
+                   goto LoopBody2;
+               }
+               goto LoopEnd2;
+               
+               // The loop iteration failed. Put state back to the way it was before the iteration.
+               LoopIterationNoMatch2:
+               if (--loop_iteration2 < 0)
+               {
+                   // Unable to match the remainder of the expression after exhausting the loop.
+                   goto LoopIterationNoMatch1;
+               }
+               pos = base.runstack![--stackpos];
+               slice = inputSpan.Slice(pos);
+               LoopEnd2:
+               stackpos = startingStackpos; // Ensure any remaining backtracking state is removed.
          //}
          
-           // Match a character in the set [0-9A-Za-z\u212A].
-           if (slice.IsEmpty || ((ch = slice[0]) < 128 ? !char.IsAsciiLetterOrDigit(ch) : !RegexRunner.CharInClass((char)ch, "\0\b\00:A[a{KÅ")))
-           {
-               goto CharLoopBacktrack1;
-           }
-           
-           pos++;
-           slice = inputSpan.Slice(pos);
-           
-           // The loop has an upper bound of 1. Continue iterating greedily if it hasn't yet been reached.
-           if (loop_iteration2 == 0)
-           {
-               goto LoopBody2;
-           }
-           goto LoopEnd2;
-           
-           // The loop iteration failed. Put state back to the way it was before the iteration.
-           LoopIterationNoMatch2:
-           if (--loop_iteration2 < 0)
-           {
-               // Unable to match the remainder of the expression after exhausting the loop.
-               goto LoopIterationNoMatch1;
-           }
-           pos = base.runstack![--stackpos];
-           slice = inputSpan.Slice(pos);
-           goto LoopEnd2;
-           
-           LoopBacktrack:
-           if (Utilities.s_hasTimeout)
-           {
-               base.CheckTimeout();
-           }
-           
-           if (loop_iteration2 == 0)
-           {
-               // No iterations of the loop remain to backtrack into. Fail the loop.
-               goto LoopIterationNoMatch1;
-           }
-           goto CharLoopBacktrack1;
-           LoopEnd2:
-           
-           Utilities.StackPush(ref base.runstack!, ref stackpos, loop_iteration2);
-           goto LoopSkipBacktrack;
-           
-           LoopBacktrack1:
-           loop_iteration2 = base.runstack![--stackpos];
-           if (Utilities.s_hasTimeout)
-           {
-               base.CheckTimeout();
-           }
-           
-           goto LoopBacktrack;
-           
-           LoopSkipBacktrack:;
-       //}
+           stackpos = atomic_stackpos;
+       }
      
      // Match '.'.
      if (slice.IsEmpty || slice[0] != '.')
      {
-           goto LoopBacktrack1;
+           goto LoopIterationNoMatch1;
      }
      
      pos++;
      slice = inputSpan.Slice(pos);
      if (loop_iteration1 == 0)
      {
-           // No iterations have been matched to backtrack into. Fail the loop.
+           // All possible iterations have matched, but it's below the required minimum of 1. Fail the loop.
          goto LoopIterationNoMatch;
      }
      
-       goto LoopEnd1;
-       
-       LoopBacktrack2:
-       if (Utilities.s_hasTimeout)
-       {
-           base.CheckTimeout();
-       }
-       
-       if (loop_iteration1 == 0)
-       {
-           // No iterations of the loop remain to backtrack into. Fail the loop.
-           goto LoopIterationNoMatch;
-       }
-       goto LoopBacktrack1;
      LoopEnd1:;
  //}
  
  // Match a character in the set [0-9A-Za-z\u212A].
  if (slice.IsEmpty || ((ch = slice[0]) < 128 ? !char.IsAsciiLetterOrDigit(ch) : !RegexRunner.CharInClass((char)ch, "\0\b\00:A[a{KÅ")))
  {
-       goto LoopBacktrack2;
+       goto LoopIterationNoMatch1;
  }
  
-   // Optional (greedy).
-   //{
-       pos++;
-       slice = inputSpan.Slice(pos);
-       loop_iteration3 = 0;
+   // Atomic group.
+   {
+       int atomic_stackpos1 = stackpos;
      
-       LoopBody3:
-       Utilities.StackPush(ref base.runstack!, ref stackpos, pos);
-       
-       loop_iteration3++;
-       
-       // Match a character in the set [\-0-9A-Za-z\u212A] greedily any number of times.
+       // Optional (greedy).
      //{
-           charloop_starting_pos2 = pos;
+           pos++;
+           slice = inputSpan.Slice(pos);
+           startingStackpos1 = stackpos;
+           loop_iteration3 = 0;
          
-           int iteration3 = slice.IndexOfAnyExcept(Utilities.s_asciiLettersAndDigitsAndDashKelvinSign);
-           if (iteration3 < 0)
+           LoopBody3:
+           Utilities.StackPush(ref base.runstack!, ref stackpos, pos);
+           
+           loop_iteration3++;
+           
+           // Match a character in the set [\-0-9A-Za-z\u212A] greedily any number of times.
+           //{
+               charloop_starting_pos2 = pos;
+               
+               int iteration3 = slice.IndexOfAnyExcept(Utilities.s_asciiLettersAndDigitsAndDashKelvinSign);
+               if (iteration3 < 0)
+               {
+                   iteration3 = slice.Length;
+               }
+               
+               slice = slice.Slice(iteration3);
+               pos += iteration3;
+               
+               charloop_ending_pos2 = pos;
+               goto CharLoopEnd2;
+               
+               CharLoopBacktrack2:
+               Utilities.StackPop(base.runstack!, ref stackpos, out charloop_ending_pos2, out charloop_starting_pos2);
+               
+               if (Utilities.s_hasTimeout)
+               {
+                   base.CheckTimeout();
+               }
+               
+               if (charloop_starting_pos2 >= charloop_ending_pos2 ||
+                   (charloop_ending_pos2 = inputSpan.Slice(charloop_starting_pos2, charloop_ending_pos2 - charloop_starting_pos2).LastIndexOfAny(Utilities.s_asciiLettersAndDigitsAndKelvinSign)) < 0)
+               {
+                   goto LoopIterationNoMatch3;
+               }
+               charloop_ending_pos2 += charloop_starting_pos2;
+               pos = charloop_ending_pos2;
+               slice = inputSpan.Slice(pos);
+               
+               CharLoopEnd2:
+               Utilities.StackPush(ref base.runstack!, ref stackpos, charloop_starting_pos2, charloop_ending_pos2);
+           //}
+           
+           // Match a character in the set [0-9A-Za-z\u212A].
+           if (slice.IsEmpty || ((ch = slice[0]) < 128 ? !char.IsAsciiLetterOrDigit(ch) : !RegexRunner.CharInClass((char)ch, "\0\b\00:A[a{KÅ")))
          {
-               iteration3 = slice.Length;
+               goto CharLoopBacktrack2;
          }
          
-           slice = slice.Slice(iteration3);
-           pos += iteration3;
-           
-           charloop_ending_pos2 = pos;
-           goto CharLoopEnd2;
-           
-           CharLoopBacktrack2:
-           Utilities.StackPop(base.runstack!, ref stackpos, out charloop_ending_pos2, out charloop_starting_pos2);
-           
-           if (Utilities.s_hasTimeout)
-           {
-               base.CheckTimeout();
-           }
-           
-           if (charloop_starting_pos2 >= charloop_ending_pos2 ||
-               (charloop_ending_pos2 = inputSpan.Slice(charloop_starting_pos2, charloop_ending_pos2 - charloop_starting_pos2).LastIndexOfAny(Utilities.s_asciiLettersAndDigitsAndKelvinSign)) < 0)
-           {
-               goto LoopIterationNoMatch3;
-           }
-           charloop_ending_pos2 += charloop_starting_pos2;
-           pos = charloop_ending_pos2;
+           pos++;
          slice = inputSpan.Slice(pos);
          
-           CharLoopEnd2:
-           Utilities.StackPush(ref base.runstack!, ref stackpos, charloop_starting_pos2, charloop_ending_pos2);
+           // The loop has an upper bound of 1. Continue iterating greedily if it hasn't yet been reached.
+           if (loop_iteration3 == 0)
+           {
+               goto LoopBody3;
+           }
+           goto LoopEnd3;
+           
+           // The loop iteration failed. Put state back to the way it was before the iteration.
+           LoopIterationNoMatch3:
+           if (--loop_iteration3 < 0)
+           {
+               // Unable to match the remainder of the expression after exhausting the loop.
+               goto LoopIterationNoMatch1;
+           }
+           pos = base.runstack![--stackpos];
+           slice = inputSpan.Slice(pos);
+           LoopEnd3:
+           stackpos = startingStackpos1; // Ensure any remaining backtracking state is removed.
      //}
      
-       // Match a character in the set [0-9A-Za-z\u212A].
-       if (slice.IsEmpty || ((ch = slice[0]) < 128 ? !char.IsAsciiLetterOrDigit(ch) : !RegexRunner.CharInClass((char)ch, "\0\b\00:A[a{KÅ")))
-       {
-           goto CharLoopBacktrack2;
-       }
-       
-       pos++;
-       slice = inputSpan.Slice(pos);
-       
-       // The loop has an upper bound of 1. Continue iterating greedily if it hasn't yet been reached.
-       if (loop_iteration3 == 0)
-       {
-           goto LoopBody3;
-       }
-       goto LoopEnd3;
-       
-       // The loop iteration failed. Put state back to the way it was before the iteration.
-       LoopIterationNoMatch3:
-       if (--loop_iteration3 < 0)
-       {
-           // Unable to match the remainder of the expression after exhausting the loop.
-           goto LoopBacktrack2;
-       }
-       pos = base.runstack![--stackpos];
-       slice = inputSpan.Slice(pos);
-       goto LoopEnd3;
-       
-       LoopBacktrack3:
-       if (Utilities.s_hasTimeout)
-       {
-           base.CheckTimeout();
-       }
-       
-       if (loop_iteration3 == 0)
-       {
-           // No iterations of the loop remain to backtrack into. Fail the loop.
-           goto LoopBacktrack2;
-       }
-       goto CharLoopBacktrack2;
-       LoopEnd3:;
-   //}
+       stackpos = atomic_stackpos1;
+   }
  
  // Match if at the end of the string or if before an ending newline.
  if (pos < inputSpan.Length - 1 || ((uint)pos < (uint)inputSpan.Length && inputSpan[pos] != '\n'))
  {
-       goto LoopBacktrack3;
+       goto LoopIterationNoMatch1;
  }
  
  // The input matched.
"^!([0-9A-Za-z_\\-]*!)?$" (3395 uses)
[GeneratedRegex("^!([0-9A-Za-z_\\-]*!)?$")]
  int matchStart = pos;
  int loop_iteration = 0;
  int stackpos = 0;
+   int startingStackpos = 0;
  ReadOnlySpan<char> slice = inputSpan.Slice(pos);
  
  // Match if at the beginning of the string.
  }
  
  // Optional (greedy).
-   //{
+   {
      pos++;
      slice = inputSpan.Slice(pos);
+       startingStackpos = stackpos;
      loop_iteration = 0;
      
      LoopBody:
      pos = base.runstack![--stackpos];
      UncaptureUntil(base.runstack![--stackpos]);
      slice = inputSpan.Slice(pos);
-       LoopEnd:;
-   //}
+       LoopEnd:
+       stackpos = startingStackpos; // Ensure any remaining backtracking state is removed.
+   }
  
  // Match if at the end of the string or if before an ending newline.
  if (pos < inputSpan.Length - 1 || ((uint)pos < (uint)inputSpan.Length && inputSpan[pos] != '\n'))
  {
-       goto LoopIterationNoMatch;
+       UncaptureUntil(0);
+       return false; // The input didn't match.
  }
  
  // The input matched.
"AssemblyFileVersion(Attribute)?\\s*\\(.*\\)\\s*" (2417 uses)
[GeneratedRegex("AssemblyFileVersion(Attribute)?\\s*\\(.*\\)\\s*")]
  int charloop_starting_pos = 0, charloop_ending_pos = 0;
  int loop_iteration = 0;
  int stackpos = 0;
+   int startingStackpos = 0;
  ReadOnlySpan<char> slice = inputSpan.Slice(pos);
  
  // Match the string "AssemblyFileVersion".
  }
  
  // Optional (greedy).
-   //{
+   {
      pos += 19;
      slice = inputSpan.Slice(pos);
+       startingStackpos = stackpos;
      loop_iteration = 0;
      
      LoopBody:
      pos = base.runstack![--stackpos];
      UncaptureUntil(base.runstack![--stackpos]);
      slice = inputSpan.Slice(pos);
-       LoopEnd:;
-   //}
+       LoopEnd:
+       stackpos = startingStackpos; // Ensure any remaining backtracking state is removed.
+   }
  
  // Match a whitespace character atomically any number of times.
  {
  // Match '('.
  if (slice.IsEmpty || slice[0] != '(')
  {
-       goto LoopIterationNoMatch;
+       UncaptureUntil(0);
+       return false; // The input didn't match.
  }
  
  // Match a character other than '\n' greedily any number of times.
      if (charloop_starting_pos >= charloop_ending_pos ||
          (charloop_ending_pos = inputSpan.Slice(charloop_starting_pos, charloop_ending_pos - charloop_starting_pos).LastIndexOf(')')) < 0)
      {
-           goto LoopIterationNoMatch;
+           UncaptureUntil(0);
+           return false; // The input didn't match.
      }
      charloop_ending_pos += charloop_starting_pos;
      pos = charloop_ending_pos;

For more diff examples, see https://gist.github.com/MihuBot/890f4991a1798fc6e523c034b42d8950

Total bytes of base: 54706120
Total bytes of diff: 54274946
Total bytes of delta: -431174 (-0.79 % of base)
Total relative delta: -56.87
    diff is an improvement.
    relative diff is an improvement.

For a list of JIT diff regressions, see Regressions.md
For a list of JIT diff improvements, see Improvements.md

Sample source code for further analysis
const string JsonPath = "RegexResults-1268.json";
if (!File.Exists(JsonPath))
{
    await using var archiveStream = await new HttpClient().GetStreamAsync("https://mihubot.xyz/r/E2WocUI");
    using var archive = new ZipArchive(archiveStream, ZipArchiveMode.Read);
    archive.Entries.First(e => e.Name == "Results.json").ExtractToFile(JsonPath);
}

using FileStream jsonFileStream = File.OpenRead(JsonPath);
RegexEntry[] entries = JsonSerializer.Deserialize<RegexEntry[]>(jsonFileStream, new JsonSerializerOptions { IncludeFields = true })!;
Console.WriteLine($"Working with {entries.Length} patterns");



record KnownPattern(string Pattern, RegexOptions Options, int Count);

sealed class RegexEntry
{
    public required KnownPattern Regex { get; set; }
    public required string MainSource { get; set; }
    public required string PrSource { get; set; }
    public string? FullDiff { get; set; }
    public string? ShortDiff { get; set; }
    public (string Name, string Values)[]? SearchValuesOfChar { get; set; }
    public (string[] Values, StringComparison ComparisonType)[]? SearchValuesOfString { get; set; }
}

@stephentoub
Copy link
Member Author

@MihuBot regexdiff

@MihuBot
Copy link

MihuBot commented Jul 23, 2025

1376 out of 18857 patterns have generated source code changes.

Examples of GeneratedRegex source diffs
"^\\s*(((?<ORIGIN>(((\\d+>)?[a-zA-Z]?:[^:]*)| ..." (7826 uses)
[GeneratedRegex("^\\s*(((?<ORIGIN>(((\\d+>)?[a-zA-Z]?:[^:]*)|([^:]*))):)|())(?<SUBCATEGORY>(()|([^:]*? )))(?<CATEGORY>(error|warning))( \\s*(?<CODE>[^: ]*))?\\s*:(?<TEXT>.*)$", RegexOptions.IgnoreCase)]
  int loop_iteration = 0;
  int loop_iteration1 = 0;
  int stackpos = 0;
+   int startingStackpos = 0;
  ReadOnlySpan<char> slice = inputSpan.Slice(pos);
  
  // Match if at the beginning of the string.
                              // Branch 0
                              //{
                                  // 4th capture group.
-                                   //{
+                                   {
                                      capture_starting_pos4 = pos;
                                      
                                      // Optional (greedy).
-                                       //{
+                                       {
+                                           startingStackpos = stackpos;
                                          loop_iteration = 0;
                                          
                                          LoopBody:
                                          pos = base.runstack![--stackpos];
                                          UncaptureUntil(base.runstack![--stackpos]);
                                          slice = inputSpan.Slice(pos);
-                                           LoopEnd:;
-                                       //}
+                                           LoopEnd:
+                                           stackpos = startingStackpos; // Ensure any remaining backtracking state is removed.
+                                       }
                                      
                                      // Match a character in the set [A-Za-z\u212A] atomically, optionally.
                                      {
                                      // Match ':'.
                                      if (slice.IsEmpty || slice[0] != ':')
                                      {
-                                           goto LoopIterationNoMatch;
+                                           goto AlternationBranch1;
                                      }
                                      
                                      // Match a character other than ':' atomically any number of times.
                                      pos++;
                                      slice = inputSpan.Slice(pos);
                                      base.Capture(4, capture_starting_pos4, pos);
-                                       
-                                       goto CaptureSkipBacktrack;
-                                       
-                                       CaptureBacktrack:
-                                       goto LoopIterationNoMatch;
-                                       
-                                       CaptureSkipBacktrack:;
-                                   //}
+                                   }
                                  
                                  alternation_branch1 = 0;
                                  goto AlternationMatch1;
                              switch (alternation_branch1)
                              {
                                  case 0:
-                                       goto CaptureBacktrack;
+                                       goto AlternationBranch1;
                                  case 1:
                                      goto AlternationBranch;
                              }
                          
                          base.Capture(3, capture_starting_pos3, pos);
                          
-                           goto CaptureSkipBacktrack1;
+                           goto CaptureSkipBacktrack;
                          
-                           CaptureBacktrack1:
+                           CaptureBacktrack:
                          goto AlternationBacktrack1;
                          
-                           CaptureSkipBacktrack1:;
+                           CaptureSkipBacktrack:;
                      //}
                      
                      base.Capture(13, capture_starting_pos2, pos);
                      
-                       goto CaptureSkipBacktrack2;
+                       goto CaptureSkipBacktrack1;
                      
-                       CaptureBacktrack2:
-                       goto CaptureBacktrack1;
+                       CaptureBacktrack1:
+                       goto CaptureBacktrack;
                      
-                       CaptureSkipBacktrack2:;
+                       CaptureSkipBacktrack1:;
                  //}
                  
                  // Match ':'.
                  if (slice.IsEmpty || slice[0] != ':')
                  {
-                       goto CaptureBacktrack2;
+                       goto CaptureBacktrack1;
                  }
                  
                  pos++;
                  slice = inputSpan.Slice(pos);
                  base.Capture(2, capture_starting_pos1, pos);
                  
-                   goto CaptureSkipBacktrack3;
+                   goto CaptureSkipBacktrack2;
                  
-                   CaptureBacktrack3:
-                   goto CaptureBacktrack2;
+                   CaptureBacktrack2:
+                   goto CaptureBacktrack1;
                  
-                   CaptureSkipBacktrack3:;
+                   CaptureSkipBacktrack2:;
              //}
              
              alternation_branch = 0;
          switch (alternation_branch)
          {
              case 0:
-                   goto CaptureBacktrack3;
+                   goto CaptureBacktrack2;
              case 1:
                  goto CharLoopBacktrack;
          }
      
      base.Capture(1, capture_starting_pos, pos);
      
-       goto CaptureSkipBacktrack4;
+       goto CaptureSkipBacktrack3;
      
-       CaptureBacktrack4:
+       CaptureBacktrack3:
      goto AlternationBacktrack;
      
-       CaptureSkipBacktrack4:;
+       CaptureSkipBacktrack3:;
  //}
  
  // "SUBCATEGORY" capture group.
                          slice = inputSpan.Slice(pos);
                          if (slice.IsEmpty || slice[0] == ':')
                          {
-                               goto CaptureBacktrack4;
+                               goto CaptureBacktrack3;
                          }
                          pos++;
                          slice = inputSpan.Slice(pos);
                          lazyloop_pos = slice.IndexOfAny(':', ' ');
                          if ((uint)lazyloop_pos >= (uint)slice.Length || slice[lazyloop_pos] == ':')
                          {
-                               goto CaptureBacktrack4;
+                               goto CaptureBacktrack3;
                          }
                          pos += lazyloop_pos;
                          slice = inputSpan.Slice(pos);
                      slice = inputSpan.Slice(pos);
                      base.Capture(10, capture_starting_pos11, pos);
                      
-                       goto CaptureSkipBacktrack5;
+                       goto CaptureSkipBacktrack4;
                      
-                       CaptureBacktrack5:
+                       CaptureBacktrack4:
                      goto LazyLoopBacktrack;
                      
-                       CaptureSkipBacktrack5:;
+                       CaptureSkipBacktrack4:;
                  //}
                  
                  alternation_branch2 = 1;
                  case 0:
                      goto AlternationBranch2;
                  case 1:
-                       goto CaptureBacktrack5;
+                       goto CaptureBacktrack4;
              }
              
              AlternationMatch2:;
          
          base.Capture(8, capture_starting_pos9, pos);
          
-           goto CaptureSkipBacktrack6;
+           goto CaptureSkipBacktrack5;
          
-           CaptureBacktrack6:
+           CaptureBacktrack5:
          goto AlternationBacktrack2;
          
-           CaptureSkipBacktrack6:;
+           CaptureSkipBacktrack5:;
      //}
      
      base.Capture(14, capture_starting_pos8, pos);
      
-       goto CaptureSkipBacktrack7;
+       goto CaptureSkipBacktrack6;
      
-       CaptureBacktrack7:
-       goto CaptureBacktrack6;
+       CaptureBacktrack6:
+       goto CaptureBacktrack5;
      
-       CaptureSkipBacktrack7:;
+       CaptureSkipBacktrack6:;
  //}
  
  // "CATEGORY" capture group.
          //{
              if (slice.IsEmpty)
              {
-                   goto CaptureBacktrack7;
+                   goto CaptureBacktrack6;
              }
              
              switch (slice[0])
                      if ((uint)slice.Length < 5 ||
                          !slice.Slice(1).StartsWith("rror", StringComparison.OrdinalIgnoreCase)) // Match the string "rror" (ordinal case-insensitive)
                      {
-                           goto CaptureBacktrack7;
+                           goto CaptureBacktrack6;
                      }
                      
                      pos += 5;
                      if ((uint)slice.Length < 7 ||
                          !slice.Slice(1).StartsWith("arning", StringComparison.OrdinalIgnoreCase)) // Match the string "arning" (ordinal case-insensitive)
                      {
-                           goto CaptureBacktrack7;
+                           goto CaptureBacktrack6;
                      }
                      
                      pos += 7;
                      break;
                      
                  default:
-                       goto CaptureBacktrack7;
+                       goto CaptureBacktrack6;
              }
          //}
          
              base.Capture(16, capture_starting_pos15, pos);
              
              Utilities.StackPush(ref base.runstack!, ref stackpos, capture_starting_pos15);
-               goto CaptureSkipBacktrack8;
+               goto CaptureSkipBacktrack7;
              
-               CaptureBacktrack8:
+               CaptureBacktrack7:
              capture_starting_pos15 = base.runstack![--stackpos];
              goto CharLoopBacktrack2;
              
-               CaptureSkipBacktrack8:;
+               CaptureSkipBacktrack7:;
          //}
          
          base.Capture(12, capture_starting_pos14, pos);
          
          Utilities.StackPush(ref base.runstack!, ref stackpos, capture_starting_pos14);
-           goto CaptureSkipBacktrack9;
+           goto CaptureSkipBacktrack8;
          
-           CaptureBacktrack9:
+           CaptureBacktrack8:
          capture_starting_pos14 = base.runstack![--stackpos];
-           goto CaptureBacktrack8;
+           goto CaptureBacktrack7;
          
-           CaptureSkipBacktrack9:;
+           CaptureSkipBacktrack8:;
      //}
      
      
      if (--loop_iteration1 < 0)
      {
          // Unable to match the remainder of the expression after exhausting the loop.
-           goto CaptureBacktrack7;
+           goto CaptureBacktrack6;
      }
      pos = base.runstack![--stackpos];
      UncaptureUntil(base.runstack![--stackpos]);
      if (loop_iteration1 == 0)
      {
          // No iterations of the loop remain to backtrack into. Fail the loop.
-           goto CaptureBacktrack7;
+           goto CaptureBacktrack6;
      }
-       goto CaptureBacktrack9;
+       goto CaptureBacktrack8;
      LoopEnd1:;
  //}
  
      
      base.Capture(17, capture_starting_pos16, pos);
      
-       goto CaptureSkipBacktrack10;
+       goto CaptureSkipBacktrack9;
      
-       CaptureBacktrack10:
+       CaptureBacktrack9:
      goto CharLoopBacktrack3;
      
-       CaptureSkipBacktrack10:;
+       CaptureSkipBacktrack9:;
  //}
  
  // Match if at the end of the string or if before an ending newline.
  if (pos < inputSpan.Length - 1 || ((uint)pos < (uint)inputSpan.Length && inputSpan[pos] != '\n'))
  {
-       goto CaptureBacktrack10;
+       goto CaptureBacktrack9;
  }
  
  // The input matched.
"\\A(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\\.[a-z ..." (5703 uses)
[GeneratedRegex("\\A(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*@(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?)\\Z", RegexOptions.IgnoreCase | RegexOptions.Singleline | RegexOptions.CultureInvariant)]
  int loop_iteration2 = 0;
  int loop_iteration3 = 0;
  int stackpos = 0;
+   int startingStackpos = 0;
+   int startingStackpos1 = 0;
  ReadOnlySpan<char> slice = inputSpan.Slice(pos);
  
  // Match if at the beginning of the string.
          goto LoopIterationNoMatch1;
      }
      
-       // Optional (greedy).
-       //{
-           pos++;
-           slice = inputSpan.Slice(pos);
-           loop_iteration2 = 0;
+       // Atomic group.
+       {
+           int atomic_stackpos = stackpos;
          
-           LoopBody2:
-           Utilities.StackPush(ref base.runstack!, ref stackpos, pos);
-           
-           loop_iteration2++;
-           
-           // Match a character in the set [\-0-9A-Za-z\u212A] greedily any number of times.
+           // Optional (greedy).
          //{
-               charloop_starting_pos1 = pos;
+               pos++;
+               slice = inputSpan.Slice(pos);
+               startingStackpos = stackpos;
+               loop_iteration2 = 0;
              
-               int iteration2 = slice.IndexOfAnyExcept(Utilities.s_asciiLettersAndDigitsAndDashKelvinSign);
-               if (iteration2 < 0)
+               LoopBody2:
+               Utilities.StackPush(ref base.runstack!, ref stackpos, pos);
+               
+               loop_iteration2++;
+               
+               // Match a character in the set [\-0-9A-Za-z\u212A] greedily any number of times.
+               //{
+                   charloop_starting_pos1 = pos;
+                   
+                   int iteration2 = slice.IndexOfAnyExcept(Utilities.s_asciiLettersAndDigitsAndDashKelvinSign);
+                   if (iteration2 < 0)
+                   {
+                       iteration2 = slice.Length;
+                   }
+                   
+                   slice = slice.Slice(iteration2);
+                   pos += iteration2;
+                   
+                   charloop_ending_pos1 = pos;
+                   goto CharLoopEnd1;
+                   
+                   CharLoopBacktrack1:
+                   Utilities.StackPop(base.runstack!, ref stackpos, out charloop_ending_pos1, out charloop_starting_pos1);
+                   
+                   if (Utilities.s_hasTimeout)
+                   {
+                       base.CheckTimeout();
+                   }
+                   
+                   if (charloop_starting_pos1 >= charloop_ending_pos1 ||
+                       (charloop_ending_pos1 = inputSpan.Slice(charloop_starting_pos1, charloop_ending_pos1 - charloop_starting_pos1).LastIndexOfAny(Utilities.s_asciiLettersAndDigitsAndKelvinSign)) < 0)
+                   {
+                       goto LoopIterationNoMatch2;
+                   }
+                   charloop_ending_pos1 += charloop_starting_pos1;
+                   pos = charloop_ending_pos1;
+                   slice = inputSpan.Slice(pos);
+                   
+                   CharLoopEnd1:
+                   Utilities.StackPush(ref base.runstack!, ref stackpos, charloop_starting_pos1, charloop_ending_pos1);
+               //}
+               
+               // Match a character in the set [0-9A-Za-z\u212A].
+               if (slice.IsEmpty || ((ch = slice[0]) < 128 ? !char.IsAsciiLetterOrDigit(ch) : !RegexRunner.CharInClass((char)ch, "\0\b\00:A[a{KÅ")))
              {
-                   iteration2 = slice.Length;
+                   goto CharLoopBacktrack1;
              }
              
-               slice = slice.Slice(iteration2);
-               pos += iteration2;
-               
-               charloop_ending_pos1 = pos;
-               goto CharLoopEnd1;
-               
-               CharLoopBacktrack1:
-               Utilities.StackPop(base.runstack!, ref stackpos, out charloop_ending_pos1, out charloop_starting_pos1);
-               
-               if (Utilities.s_hasTimeout)
-               {
-                   base.CheckTimeout();
-               }
-               
-               if (charloop_starting_pos1 >= charloop_ending_pos1 ||
-                   (charloop_ending_pos1 = inputSpan.Slice(charloop_starting_pos1, charloop_ending_pos1 - charloop_starting_pos1).LastIndexOfAny(Utilities.s_asciiLettersAndDigitsAndKelvinSign)) < 0)
-               {
-                   goto LoopIterationNoMatch2;
-               }
-               charloop_ending_pos1 += charloop_starting_pos1;
-               pos = charloop_ending_pos1;
+               pos++;
              slice = inputSpan.Slice(pos);
              
-               CharLoopEnd1:
-               Utilities.StackPush(ref base.runstack!, ref stackpos, charloop_starting_pos1, charloop_ending_pos1);
+               // The loop has an upper bound of 1. Continue iterating greedily if it hasn't yet been reached.
+               if (loop_iteration2 == 0)
+               {
+                   goto LoopBody2;
+               }
+               goto LoopEnd2;
+               
+               // The loop iteration failed. Put state back to the way it was before the iteration.
+               LoopIterationNoMatch2:
+               if (--loop_iteration2 < 0)
+               {
+                   // Unable to match the remainder of the expression after exhausting the loop.
+                   goto LoopIterationNoMatch1;
+               }
+               pos = base.runstack![--stackpos];
+               slice = inputSpan.Slice(pos);
+               LoopEnd2:
+               stackpos = startingStackpos; // Ensure any remaining backtracking state is removed.
          //}
          
-           // Match a character in the set [0-9A-Za-z\u212A].
-           if (slice.IsEmpty || ((ch = slice[0]) < 128 ? !char.IsAsciiLetterOrDigit(ch) : !RegexRunner.CharInClass((char)ch, "\0\b\00:A[a{KÅ")))
-           {
-               goto CharLoopBacktrack1;
-           }
-           
-           pos++;
-           slice = inputSpan.Slice(pos);
-           
-           // The loop has an upper bound of 1. Continue iterating greedily if it hasn't yet been reached.
-           if (loop_iteration2 == 0)
-           {
-               goto LoopBody2;
-           }
-           goto LoopEnd2;
-           
-           // The loop iteration failed. Put state back to the way it was before the iteration.
-           LoopIterationNoMatch2:
-           if (--loop_iteration2 < 0)
-           {
-               // Unable to match the remainder of the expression after exhausting the loop.
-               goto LoopIterationNoMatch1;
-           }
-           pos = base.runstack![--stackpos];
-           slice = inputSpan.Slice(pos);
-           goto LoopEnd2;
-           
-           LoopBacktrack:
-           if (Utilities.s_hasTimeout)
-           {
-               base.CheckTimeout();
-           }
-           
-           if (loop_iteration2 == 0)
-           {
-               // No iterations of the loop remain to backtrack into. Fail the loop.
-               goto LoopIterationNoMatch1;
-           }
-           goto CharLoopBacktrack1;
-           LoopEnd2:
-           
-           Utilities.StackPush(ref base.runstack!, ref stackpos, loop_iteration2);
-           goto LoopSkipBacktrack;
-           
-           LoopBacktrack1:
-           loop_iteration2 = base.runstack![--stackpos];
-           if (Utilities.s_hasTimeout)
-           {
-               base.CheckTimeout();
-           }
-           
-           goto LoopBacktrack;
-           
-           LoopSkipBacktrack:;
-       //}
+           stackpos = atomic_stackpos;
+       }
      
      // Match '.'.
      if (slice.IsEmpty || slice[0] != '.')
      {
-           goto LoopBacktrack1;
+           goto LoopIterationNoMatch1;
      }
      
      pos++;
      slice = inputSpan.Slice(pos);
      if (loop_iteration1 == 0)
      {
-           // No iterations have been matched to backtrack into. Fail the loop.
+           // All possible iterations have matched, but it's below the required minimum of 1. Fail the loop.
          goto LoopIterationNoMatch;
      }
      
-       goto LoopEnd1;
-       
-       LoopBacktrack2:
-       if (Utilities.s_hasTimeout)
-       {
-           base.CheckTimeout();
-       }
-       
-       if (loop_iteration1 == 0)
-       {
-           // No iterations of the loop remain to backtrack into. Fail the loop.
-           goto LoopIterationNoMatch;
-       }
-       goto LoopBacktrack1;
      LoopEnd1:;
  //}
  
  // Match a character in the set [0-9A-Za-z\u212A].
  if (slice.IsEmpty || ((ch = slice[0]) < 128 ? !char.IsAsciiLetterOrDigit(ch) : !RegexRunner.CharInClass((char)ch, "\0\b\00:A[a{KÅ")))
  {
-       goto LoopBacktrack2;
+       goto LoopIterationNoMatch1;
  }
  
-   // Optional (greedy).
-   //{
-       pos++;
-       slice = inputSpan.Slice(pos);
-       loop_iteration3 = 0;
+   // Atomic group.
+   {
+       int atomic_stackpos1 = stackpos;
      
-       LoopBody3:
-       Utilities.StackPush(ref base.runstack!, ref stackpos, pos);
-       
-       loop_iteration3++;
-       
-       // Match a character in the set [\-0-9A-Za-z\u212A] greedily any number of times.
+       // Optional (greedy).
      //{
-           charloop_starting_pos2 = pos;
+           pos++;
+           slice = inputSpan.Slice(pos);
+           startingStackpos1 = stackpos;
+           loop_iteration3 = 0;
          
-           int iteration3 = slice.IndexOfAnyExcept(Utilities.s_asciiLettersAndDigitsAndDashKelvinSign);
-           if (iteration3 < 0)
+           LoopBody3:
+           Utilities.StackPush(ref base.runstack!, ref stackpos, pos);
+           
+           loop_iteration3++;
+           
+           // Match a character in the set [\-0-9A-Za-z\u212A] greedily any number of times.
+           //{
+               charloop_starting_pos2 = pos;
+               
+               int iteration3 = slice.IndexOfAnyExcept(Utilities.s_asciiLettersAndDigitsAndDashKelvinSign);
+               if (iteration3 < 0)
+               {
+                   iteration3 = slice.Length;
+               }
+               
+               slice = slice.Slice(iteration3);
+               pos += iteration3;
+               
+               charloop_ending_pos2 = pos;
+               goto CharLoopEnd2;
+               
+               CharLoopBacktrack2:
+               Utilities.StackPop(base.runstack!, ref stackpos, out charloop_ending_pos2, out charloop_starting_pos2);
+               
+               if (Utilities.s_hasTimeout)
+               {
+                   base.CheckTimeout();
+               }
+               
+               if (charloop_starting_pos2 >= charloop_ending_pos2 ||
+                   (charloop_ending_pos2 = inputSpan.Slice(charloop_starting_pos2, charloop_ending_pos2 - charloop_starting_pos2).LastIndexOfAny(Utilities.s_asciiLettersAndDigitsAndKelvinSign)) < 0)
+               {
+                   goto LoopIterationNoMatch3;
+               }
+               charloop_ending_pos2 += charloop_starting_pos2;
+               pos = charloop_ending_pos2;
+               slice = inputSpan.Slice(pos);
+               
+               CharLoopEnd2:
+               Utilities.StackPush(ref base.runstack!, ref stackpos, charloop_starting_pos2, charloop_ending_pos2);
+           //}
+           
+           // Match a character in the set [0-9A-Za-z\u212A].
+           if (slice.IsEmpty || ((ch = slice[0]) < 128 ? !char.IsAsciiLetterOrDigit(ch) : !RegexRunner.CharInClass((char)ch, "\0\b\00:A[a{KÅ")))
          {
-               iteration3 = slice.Length;
+               goto CharLoopBacktrack2;
          }
          
-           slice = slice.Slice(iteration3);
-           pos += iteration3;
-           
-           charloop_ending_pos2 = pos;
-           goto CharLoopEnd2;
-           
-           CharLoopBacktrack2:
-           Utilities.StackPop(base.runstack!, ref stackpos, out charloop_ending_pos2, out charloop_starting_pos2);
-           
-           if (Utilities.s_hasTimeout)
-           {
-               base.CheckTimeout();
-           }
-           
-           if (charloop_starting_pos2 >= charloop_ending_pos2 ||
-               (charloop_ending_pos2 = inputSpan.Slice(charloop_starting_pos2, charloop_ending_pos2 - charloop_starting_pos2).LastIndexOfAny(Utilities.s_asciiLettersAndDigitsAndKelvinSign)) < 0)
-           {
-               goto LoopIterationNoMatch3;
-           }
-           charloop_ending_pos2 += charloop_starting_pos2;
-           pos = charloop_ending_pos2;
+           pos++;
          slice = inputSpan.Slice(pos);
          
-           CharLoopEnd2:
-           Utilities.StackPush(ref base.runstack!, ref stackpos, charloop_starting_pos2, charloop_ending_pos2);
+           // The loop has an upper bound of 1. Continue iterating greedily if it hasn't yet been reached.
+           if (loop_iteration3 == 0)
+           {
+               goto LoopBody3;
+           }
+           goto LoopEnd3;
+           
+           // The loop iteration failed. Put state back to the way it was before the iteration.
+           LoopIterationNoMatch3:
+           if (--loop_iteration3 < 0)
+           {
+               // Unable to match the remainder of the expression after exhausting the loop.
+               goto LoopIterationNoMatch1;
+           }
+           pos = base.runstack![--stackpos];
+           slice = inputSpan.Slice(pos);
+           LoopEnd3:
+           stackpos = startingStackpos1; // Ensure any remaining backtracking state is removed.
      //}
      
-       // Match a character in the set [0-9A-Za-z\u212A].
-       if (slice.IsEmpty || ((ch = slice[0]) < 128 ? !char.IsAsciiLetterOrDigit(ch) : !RegexRunner.CharInClass((char)ch, "\0\b\00:A[a{KÅ")))
-       {
-           goto CharLoopBacktrack2;
-       }
-       
-       pos++;
-       slice = inputSpan.Slice(pos);
-       
-       // The loop has an upper bound of 1. Continue iterating greedily if it hasn't yet been reached.
-       if (loop_iteration3 == 0)
-       {
-           goto LoopBody3;
-       }
-       goto LoopEnd3;
-       
-       // The loop iteration failed. Put state back to the way it was before the iteration.
-       LoopIterationNoMatch3:
-       if (--loop_iteration3 < 0)
-       {
-           // Unable to match the remainder of the expression after exhausting the loop.
-           goto LoopBacktrack2;
-       }
-       pos = base.runstack![--stackpos];
-       slice = inputSpan.Slice(pos);
-       goto LoopEnd3;
-       
-       LoopBacktrack3:
-       if (Utilities.s_hasTimeout)
-       {
-           base.CheckTimeout();
-       }
-       
-       if (loop_iteration3 == 0)
-       {
-           // No iterations of the loop remain to backtrack into. Fail the loop.
-           goto LoopBacktrack2;
-       }
-       goto CharLoopBacktrack2;
-       LoopEnd3:;
-   //}
+       stackpos = atomic_stackpos1;
+   }
  
  // Match if at the end of the string or if before an ending newline.
  if (pos < inputSpan.Length - 1 || ((uint)pos < (uint)inputSpan.Length && inputSpan[pos] != '\n'))
  {
-       goto LoopBacktrack3;
+       goto LoopIterationNoMatch1;
  }
  
  // The input matched.
"^!([0-9A-Za-z_\\-]*!)?$" (3395 uses)
[GeneratedRegex("^!([0-9A-Za-z_\\-]*!)?$")]
  int matchStart = pos;
  int loop_iteration = 0;
  int stackpos = 0;
+   int startingStackpos = 0;
  ReadOnlySpan<char> slice = inputSpan.Slice(pos);
  
  // Match if at the beginning of the string.
  }
  
  // Optional (greedy).
-   //{
+   {
      pos++;
      slice = inputSpan.Slice(pos);
+       startingStackpos = stackpos;
      loop_iteration = 0;
      
      LoopBody:
      pos = base.runstack![--stackpos];
      UncaptureUntil(base.runstack![--stackpos]);
      slice = inputSpan.Slice(pos);
-       LoopEnd:;
-   //}
+       LoopEnd:
+       stackpos = startingStackpos; // Ensure any remaining backtracking state is removed.
+   }
  
  // Match if at the end of the string or if before an ending newline.
  if (pos < inputSpan.Length - 1 || ((uint)pos < (uint)inputSpan.Length && inputSpan[pos] != '\n'))
  {
-       goto LoopIterationNoMatch;
+       UncaptureUntil(0);
+       return false; // The input didn't match.
  }
  
  // The input matched.
"AssemblyFileVersion(Attribute)?\\s*\\(.*\\)\\s*" (2417 uses)
[GeneratedRegex("AssemblyFileVersion(Attribute)?\\s*\\(.*\\)\\s*")]
  int charloop_starting_pos = 0, charloop_ending_pos = 0;
  int loop_iteration = 0;
  int stackpos = 0;
+   int startingStackpos = 0;
  ReadOnlySpan<char> slice = inputSpan.Slice(pos);
  
  // Match the string "AssemblyFileVersion".
  }
  
  // Optional (greedy).
-   //{
+   {
      pos += 19;
      slice = inputSpan.Slice(pos);
+       startingStackpos = stackpos;
      loop_iteration = 0;
      
      LoopBody:
      pos = base.runstack![--stackpos];
      UncaptureUntil(base.runstack![--stackpos]);
      slice = inputSpan.Slice(pos);
-       LoopEnd:;
-   //}
+       LoopEnd:
+       stackpos = startingStackpos; // Ensure any remaining backtracking state is removed.
+   }
  
  // Match a whitespace character atomically any number of times.
  {
  // Match '('.
  if (slice.IsEmpty || slice[0] != '(')
  {
-       goto LoopIterationNoMatch;
+       UncaptureUntil(0);
+       return false; // The input didn't match.
  }
  
  // Match a character other than '\n' greedily any number of times.
      if (charloop_starting_pos >= charloop_ending_pos ||
          (charloop_ending_pos = inputSpan.Slice(charloop_starting_pos, charloop_ending_pos - charloop_starting_pos).LastIndexOf(')')) < 0)
      {
-           goto LoopIterationNoMatch;
+           UncaptureUntil(0);
+           return false; // The input didn't match.
      }
      charloop_ending_pos += charloop_starting_pos;
      pos = charloop_ending_pos;

For more diff examples, see https://gist.github.com/MihuBot/be318cdd30033e74e6db6121388320b2

Total bytes of base: 54706120
Total bytes of diff: 54252502
Total bytes of delta: -453618 (-0.83 % of base)
Total relative delta: -74.45
    diff is an improvement.
    relative diff is an improvement.

For a list of JIT diff regressions, see Regressions.md
For a list of JIT diff improvements, see Improvements.md

Sample source code for further analysis
const string JsonPath = "RegexResults-1269.json";
if (!File.Exists(JsonPath))
{
    await using var archiveStream = await new HttpClient().GetStreamAsync("https://mihubot.xyz/r/E2W4BR-A");
    using var archive = new ZipArchive(archiveStream, ZipArchiveMode.Read);
    archive.Entries.First(e => e.Name == "Results.json").ExtractToFile(JsonPath);
}

using FileStream jsonFileStream = File.OpenRead(JsonPath);
RegexEntry[] entries = JsonSerializer.Deserialize<RegexEntry[]>(jsonFileStream, new JsonSerializerOptions { IncludeFields = true })!;
Console.WriteLine($"Working with {entries.Length} patterns");



record KnownPattern(string Pattern, RegexOptions Options, int Count);

sealed class RegexEntry
{
    public required KnownPattern Regex { get; set; }
    public required string MainSource { get; set; }
    public required string PrSource { get; set; }
    public string? FullDiff { get; set; }
    public string? ShortDiff { get; set; }
    public (string Name, string Values)[]? SearchValuesOfChar { get; set; }
    public (string[] Values, StringComparison ComparisonType)[]? SearchValuesOfString { get; set; }
}

Today we only support conversion of loops to atomic loops for single-character loops (e.g. a* or [abc]*). This PR augments the logic to support arbitrary loops, enabling many more loops to become atomic.
@stephentoub stephentoub merged commit c0f4c7d into dotnet:main Jul 24, 2025
83 of 87 checks passed
@stephentoub stephentoub deleted the atomicloops branch July 24, 2025 18:15
@github-actions github-actions bot locked and limited conversation to collaborators Aug 24, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants