Skip to content

Commit ace0eae

Browse files
committed
Exclude RtoL characters from paired regex delimiters
Fixes Perl#22228 Some scripts in the world are written right-to-left, such as Arabic and Hebrew. This can result in confusion for regex pattern delimitters that we have chosen based on left-to_right. Therefore exclude all such. Currently, the only two that fall into this category that we don't exclude for other reasons are SYRIAC COLON SKEWED LEFT/RIGHT.
1 parent 1a2e8e7 commit ace0eae

File tree

3 files changed

+19
-11
lines changed

3 files changed

+19
-11
lines changed

pod/perlop.pod

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -3850,7 +3850,6 @@ The complete list of accepted paired delimiters as of Unicode 14.0 is:
38503850
{ } U+007B, U+007D LEFT/RIGHT CURLY BRACKET
38513851
« » U+00AB, U+00BB LEFT/RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
38523852
» « U+00BB, U+00AB RIGHT/LEFT-POINTING DOUBLE ANGLE QUOTATION MARK
3853-
܆ ܇ U+0706, U+0707 SYRIAC COLON SKEWED LEFT/RIGHT
38543853
༺ ༻ U+0F3A, U+0F3B TIBETAN MARK GUG RTAGS GYON, TIBETAN MARK GUG
38553854
RTAGS GYAS
38563855
༼ ༽ U+0F3C, U+0F3D TIBETAN MARK ANG KHANG GYON, TIBETAN MARK ANG
@@ -4231,5 +4230,4 @@ The complete list of accepted paired delimiters as of Unicode 14.0 is:
42314230
🢩 🢨 U+1F8A9, U+1F8A8 RIGHT/LEFTWARDS BACK-TILTED SHADOWED WHITE ARROW
42324231
🢫 🢪 U+1F8AB, U+1F8AA RIGHT/LEFTWARDS FRONT-TILTED SHADOWED WHITE
42334232
ARROW
4234-
42354233
=cut

regen/unicode_constants.pl

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -378,6 +378,7 @@ END
378378
my $illegal = "Mirror illegal";
379379
my $no_encoded_mate = "Mirrored, but Unicode has no encoded mirror";
380380
my $bidirectional = "Bidirectional";
381+
my $r2l = "Is in a Right to Left script";
381382

382383
my %unused_bidi_pairs;
383384
my %inverted_unused_bidi_pairs;
@@ -634,6 +635,15 @@ END
634635
next;
635636
}
636637

638+
# Exclude characters that are R to L ordering, as this can cause
639+
# confusion. See GH #22228
640+
if ($chr =~ / (?[ \p{Bidi_Class:R} + \p{Bidi_Class:AL} ]) /x) {
641+
$discards{$code_point} = { reason => $r2l,
642+
mirror => undef
643+
};
644+
next;
645+
}
646+
637647
# We enter the pair with the original code point on the left; if it
638648
# should instead be on the R, swap. Most Symbols that contain the
639649
# word REVERSE go on the rhs, except those whose names explicitly

0 commit comments

Comments
 (0)