[perl #74022] Parser hangs on some Unicode characters

Father Chrysostomos · Father Chrysostomos · commit d7425188c3be · 2010-11-14T06:47:29.000-08:00
This changes the definition of isIDFIRST_utf8 to avoid any characters
that would put the parser in a loop.

isIDFIRST_utf8 is used all over the place in toke.c. Almost every
instance is followed by a call to S_scan_word. S_scan_word is only
called when it is known that there is a word to scan.

What was happening was that isIDFIRST_utf8 would accept a character,
but S_scan_word in toke.t would then reject it, as it was using
is_utf8_alnum, resulting in an infinite number of zero-length
identifiers.

Another possible solution was to change S_scan_word to use
isIDFIRST_utf8 or similar, but that has back-compatibility problems,
as it stops q·foo· from being a strings and makes it an identi-
fier instead.
diff --git a/handy.h b/handy.h
@@ -849,10 +849,16 @@ patched there.  The file as of this writing is cpan/Devel-PPPort/parts/inc/misc
 #define isBLANK_LC_uni(c)	isBLANK(c) /* could be wrong */
 
 #define isALNUM_utf8(p)		is_utf8_alnum(p)
-/* The ID_Start of Unicode is quite limiting: it assumes a L-class
- * character (meaning that you cannot have, say, a CJK character).
- * Instead, let's allow ID_Continue but not digits. */
-#define isIDFIRST_utf8(p)	(is_utf8_idcont(p) && !is_utf8_digit(p))
+/* The ID_Start of Unicode was originally quite limiting: it assumed an
+ * L-class character (meaning that you could not have, say, a CJK charac-
+ * ter). So, instead, perl has for a long time allowed ID_Continue but
+ * not digits.
+ * We still preserve that for backward compatibility. But we also make sure
+ * that it is alphanumeric, so S_scan_word in toke.c will not hang. See
+ *    http://rt.perl.org/rt3/Ticket/Display.html?id=74022
+ * for more detail than you ever wanted to know about. */
+#define isIDFIRST_utf8(p) \
+    (is_utf8_idcont(p) && !is_utf8_digit(p) && is_utf8_alnum(p))
 #define isALPHA_utf8(p)		is_utf8_alpha(p)
 #define isSPACE_utf8(p)		is_utf8_space(p)
 #define isDIGIT_utf8(p)		is_utf8_digit(p)
diff --git a/t/comp/parser.t b/t/comp/parser.t
@@ -3,7 +3,7 @@
 # Checks if the parser behaves correctly in edge cases
 # (including weird syntax errors)
 
-print "1..122\n";
+print "1..123\n";
 
 sub failed {
     my ($got, $expected, $name) = @_;
@@ -355,6 +355,11 @@ is($@, "", "multiline whitespace inside substitute expression");
 
 # Add new tests HERE:
 
+# bug #74022: Loop on characters in \p{OtherIDContinue}
+# This test hangs if it fails.
+eval chr 0x387;
+is(1,1, '[perl #74022] Parser looping on OtherIDContinue chars');
+
 # More awkward tests for #line. Keep these at the end, as they will screw
 # with sane line reporting for any other test failures