"the Unicode bug", reversed? #11635

p5pRT · 2011-09-06T15:28:58Z

Migrated from rt.perl.org#98546 (status was 'resolved')

Searchable as RT98546$

p5pRT · 2011-09-06T15:28:59Z

From [email protected]

Summary: If you use -E, matches fail that work fine under -e. This is
in some sense the opposite of the Unicode bug, which normally
works the other way around.

Matthew Barnett, who is implementing full casefolding in Python,
initially reported to me these Perl bugs:

However, these match:

"\N{LATIN SMALL LETTER SHARP S}" =~ /ss/i
"\N{LATIN SMALL LIGATURE LONG S T}" =~ /st/i
"\N{LATIN SMALL LIGATURE ST}" =~ /st/i
"\N{LATIN SMALL LETTER SHARP S}t" =~ /sst/i

but these don't match:

"s\N{LATIN SMALL LIGATURE LONG S T}" =~ /sst/i
"s\N{LATIN SMALL LIGATURE ST}" =~ /sst/i

I think what might be happening is that it isn't handling the
possibility of overlapping full case-folding.

When it sees "sst" in the regex it identifies "ss" as a possible result
of full case-folding and so adds the unfolded alternative:

ss => ss|\N{LATIN SMALL LETTER SHARP S}

but it then doesn't identify "st" as another possible result of full
case-folding, so it doesn't add the unfolded alternative (either of
them, in fact):

st => st|\N{LATIN SMALL LIGATURE ST}

It should be doing:

sst => sst|\N{LATIN SMALL LETTER SHARP S}t|s\N{LATIN SMALL LIGATURE ST}

(Again, I'm ignoring the other alternative.)

And it is indeed true that those two test cases fail, under both 5.14 and blead:

This is perl 5, version 14, subversion 0 (v5.14.0) built for darwin-2level

This is perl 5, version 15, subversion 2 (v5.15.2-264-g87e4a53) built for darwin-2level

As shown here:

% perl -Mcharnames=:full -lE 'print "s\N{LATIN SMALL LIGATURE LONG S T}" =~ /sst/i ? "Pass" : "Fail"'
Fail
% perl -Mcharnames=:full -lE 'print "s\N{LATIN SMALL LIGATURE ST}" =~ /sst/i ? "Pass" : "Fail"'
Fail

However, merely change the -E to a -e, suddenly they work!

% perl -Mcharnames=:full -le 'print "s\N{LATIN SMALL LIGATURE LONG S T}" =~ /sst/i ? "Pass" : "Fail"'
Pass
% perl -Mcharnames=:full -le 'print "s\N{LATIN SMALL LIGATURE ST}" =~ /sst/i ? "Pass" : "Fail"'
Pass

So it looks like this is some reverse Unicode bug. Very strange.

For the record, Ruby does get these right:

% ruby 'print "s\uFB05" =~ /sst/i ? "Pass" : "Fail"'
Pass
% ruby 'print "s\uFB06" =~ /sst/i ? "Pass" : "Fail"'
Pass

Where that is:

% ruby -v
ruby 1.9.2p0 (2010-08-18 revision 29036) [i386-darwin9.8.0]

Here are other, probably related issues:

% perl -lE 'print "\x{FB05}" =~ /st/i ? "Pass" : "Fail"'
Pass
% perl -lE 'print "\x{DF}\x{FB05}" =~ /st/i ? "Pass" : "Fail"'
Fail
% blead -lE 'print "\x{DF}\x{FB05}" =~ /st/i ? "Pass" : "Fail"'
Fail

However, unlike the early attempts, *those* do *not* suddenly pass if
you use -e instead of -E:

% perl -le 'print "\x{DF}\x{FB05}" =~ /st/i ? "Pass" : "Fail"'
Fail
% blead -le 'print "\x{DF}\x{FB05}" =~ /st/i ? "Pass" : "Fail"'
Fail

See; it still fails. Very strange. They work fine in Ruby:

% ruby -le 'print "\uFB05" =~ /st/i ? "Pass" : "Fail"'
Pass
% ruby -le 'print "\u00DF\uFB05" =~ /st/i ? "Pass" : "Fail"'
Pass

Like Perl, Ruby does *not* do partial matches of full casefolds
(I don't think the idea makes sense), so it's not like it's going
totally overboard with full casefolding:

% perl -lE 'print "\x{DF}\x{FB05}" =~ /ssst/i ? "Pass" : "Fail"'
Pass
% ruby -le 'print "\u00DF\uFB05" =~ /ssst/i ? "Pass" : "Fail"'
Pass

% perl -lE 'print "\x{DF}\x{FB05}" =~ /sst/i ? "Pass" : "Fail"'
Fail
% ruby -le 'print "\u00DF\uFB05" =~ /sst/i ? "Pass" : "Fail"'
Fail

Which is as expected. The others aren't.

--tom
--

Summary of my perl5 (revision 5 version 14 subversion 0) configuration:

Platform:
osname=openbsd, osvers=4.4, archname=OpenBSD.i386-openbsd
uname='openbsd chthon 4.4 generic#0 i386 '
config_args='-des'
hint=recommended, useposix=true, d_sigaction=define
useithreads=undef, usemultiplicity=undef
useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
use64bitint=undef, use64bitall=undef, uselongdouble=undef
usemymalloc=y, bincompat5005=undef
Compiler:
cc='cc', ccflags ='-fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include',
optimize='-O2',
cppflags='-fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include'
ccversion='', gccversion='3.3.5 (propolice)', gccosandvers='openbsd4.4'
intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
alignbytes=4, prototype=define
Linker and Libraries:
ld='cc', ldflags ='-Wl,-E -fstack-protector -L/usr/local/lib'
libpth=/usr/local/lib /usr/lib
libs=-lgdbm -lm -lutil -lc
perllibs=-lm -lutil -lc
libc=/usr/lib/libc.so.48.0, so=so, useshrplib=false, libperl=libperl.a
gnulibc_version=''
Dynamic Linking:
dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags=' '
cccdlflags='-DPIC -fPIC ', lddlflags='-shared -fPIC -L/usr/local/lib -fstack-protector'

Characteristics of this binary (from libperl):
Compile-time options: MYMALLOC PERL_DONT_CREATE_GVSV PERL_MALLOC_WRAP
PERL_PRESERVE_IVUV USE_LARGE_FILES USE_PERLIO
USE_PERL_ATOF
Built under openbsd
Compiled at Jun 11 2011 11:48:28
%ENV:
PERL_UNICODE="SA"
@INC:
/usr/local/lib/perl5/site_perl/5.14.0/OpenBSD.i386-openbsd
/usr/local/lib/perl5/site_perl/5.14.0
/usr/local/lib/perl5/5.14.0/OpenBSD.i386-openbsd
/usr/local/lib/perl5/5.14.0
/usr/local/lib/perl5/site_perl/5.12.3
/usr/local/lib/perl5/site_perl/5.11.3
/usr/local/lib/perl5/site_perl/5.10.1
/usr/local/lib/perl5/site_perl/5.10.0
/usr/local/lib/perl5/site_perl/5.8.7
/usr/local/lib/perl5/site_perl/5.8.0
/usr/local/lib/perl5/site_perl/5.6.0
/usr/local/lib/perl5/site_perl/5.005
/usr/local/lib/perl5/site_perl
.

p5pRT · 2011-09-06T21:40:00Z

From @rgarcia

On 6 September 2011 17:29, tchrist1 wrote:

Summary: If you use -E, matches fail that work fine under -e. This is
in some sense the opposite of the Unicode bug, which normally
works the other way around.

That would be because -E turns on all current features, including
"unicode_strings"

§ perl -Mcharnames=:full -Mfeature=unicode_strings -le 'print
"s\N{LATIN SMALL LIGATURE LONG S T}" = /sst/i ? "Pass" : "Fail"'
Fail
§ perl -Mcharnames=:full -lE 'no feature "unicode_strings";print
"s\N{LATIN SMALL LIGATURE LONG S T}" = /sst/i ? "Pass" : "Fail"'
Pass

Now, I would let Karl comment on whether this is a bug in
unicode_strings, or not...

p5pRT · 2011-09-06T21:40:00Z

The RT System itself - Status changed from 'new' to 'open'

p5pRT · 2011-09-07T15:31:14Z

From [email protected]

"Rafael Garcia-Suarez via RT" <perlbug-followup@perl.org> wrote
on Tue, 06 Sep 2011 14:40:00 PDT:

On 6 September 2011 17:29, tchrist1 wrote:

Summary: If you use -E, matches fail that work fine under -e. This is
in some sense the opposite of the Unicode bug, which normally
works the other way around.

That would be because -E turns on all current features, including
"unicode_strings"

Yes, I realize that. Normally, "The Unicode Bug" is cleared up by
enabling the unicode_strings feature, whereas here doing so triggers it:

§ perl -Mcharnames=:full -Mfeature=unicode_strings -le 'print
"s\N{LATIN SMALL LIGATURE LONG S T}" = /sst/i ? "Pass" : "Fail"'
Fail

§ perl -Mcharnames=:full -lE 'no feature "unicode_strings";print
"s\N{LATIN SMALL LIGATURE LONG S T}" = /sst/i ? "Pass" : "Fail"'
Pass

Now, I would let Karl comment on whether this is a bug in
unicode_strings, or not...

I can't really see how it wouldn't be.

--tom

p5pRT · 2011-09-08T01:09:06Z

From @khwilliamson

On 09/07/2011 09:30 AM, Tom Christiansen wrote:

"Rafael Garcia-Suarez via RT"<perlbug-followup@perl.org> wrote
on Tue, 06 Sep 2011 14:40:00 PDT:

On 6 September 2011 17:29, tchrist1 wrote:

Summary: If you use -E, matches fail that work fine under -e. This is
in some sense the opposite of the Unicode bug, which normally
works the other way around.

That would be because -E turns on all current features, including
"unicode_strings"

Yes, I realize that. Normally, "The Unicode Bug" is cleared up by
enabling the unicode_strings feature, whereas here doing so triggers it:
~§ perl \-Mcharnames=&#8203;:full \-Mfeature=unicode\_strings \-le 'print
"s\\N\{LATIN SMALL LIGATURE LONG S T\}" =~ /sst/i ? "Pass" : "Fail"'
Fail
~§ perl \-Mcharnames=&#8203;:full \-lE 'no feature "unicode\_strings";print
"s\\N\{LATIN SMALL LIGATURE LONG S T\}" =~ /sst/i ? "Pass" : "Fail"'
Pass
Now, I would let Karl comment on whether this is a bug in
unicode_strings, or not...

I can't really see how it wouldn't be.

--tom

It's not a bug in unicode_strings, as that is irrelevant here, since the
string is in utf8. It is a bug in regcomp.c; and it is from my
overlooking the situation specified in the bug report. I am thinking
about solutions; probably a static analysis done under control of regen
to look for cases where the tail of a multichar fold can be the head of
another, and then have regcomp look for those and substitute in an
appropriate pattern. In the case of sst it would be something like
(?i:sst|\x{df}t|s\x{fb05}) plus whatever else the analysis for this
situation calls for, and the /i matching in the result is restricted to
single char folds, so e.g., \x{df} will match its capital, but not
expand out to 'ss' again.

This still won't handle the cases like (ss)(t), etc.

p5pRT · 2011-09-08T02:13:41Z

From @chipdude

On 9/7/2011 6:07 PM, Karl Williamson wrote:

I am thinking about solutions; probably a static analysis done under
control of regen to look for cases where the tail of a multichar fold
can be the head of another, and then have regcomp look for those and
substitute in an appropriate pattern. In the case of sst it would be
something like (?i:sst|\x{df}t|s\x{fb05}) plus whatever else the
analysis for this situation calls for, and the /i matching in the
result is restricted to single char folds, so e.g., \x{df} will match
its capital, but not expand out to 'ss' again.

This still won't handle the cases like (ss)(t), etc.

I've formulated a law that covers such things.

http://chip.typepad.com/weblog/2011/09/salzenbergs-law-of-pretense.html

Related: "Do not lie to thy compiler, for it shall get its revenge." -
Henry Spencer

p5pRT · 2011-09-08T04:22:29Z

From @khwilliamson

On 09/07/2011 07:07 PM, Karl Williamson wrote:

On 09/07/2011 09:30 AM, Tom Christiansen wrote:

"Rafael Garcia-Suarez via RT"<perlbug-followup@perl.org> wrote
on Tue, 06 Sep 2011 14:40:00 PDT:

On 6 September 2011 17:29, tchrist1 wrote:

Summary: If you use -E, matches fail that work fine under -e. This is
in some sense the opposite of the Unicode bug, which normally
works the other way around.

That would be because -E turns on all current features, including
"unicode_strings"

Yes, I realize that. Normally, "The Unicode Bug" is cleared up by
enabling the unicode_strings feature, whereas here doing so triggers it:

§ perl -Mcharnames=:full -Mfeature=unicode_strings -le 'print
"s\N{LATIN SMALL LIGATURE LONG S T}" = /sst/i ? "Pass" : "Fail"'
Fail

§ perl -Mcharnames=:full -lE 'no feature "unicode_strings";print
"s\N{LATIN SMALL LIGATURE LONG S T}" = /sst/i ? "Pass" : "Fail"'
Pass

Now, I would let Karl comment on whether this is a bug in
unicode_strings, or not...

I can't really see how it wouldn't be.

--tom

It's not a bug in unicode_strings, as that is irrelevant here, since the
string is in utf8. It is a bug in regcomp.c; and it is from my
overlooking the situation specified in the bug report. I am thinking
about solutions; probably a static analysis done under control of regen
to look for cases where the tail of a multichar fold can be the head of
another, and then have regcomp look for those and substitute in an
appropriate pattern. In the case of sst it would be something like
(?i:sst|\x{df}t|s\x{fb05}) plus whatever else the analysis for this
situation calls for, and the /i matching in the result is restricted to
single char folds, so e.g., \x{df} will match its capital, but not
expand out to 'ss' again.

This still won't handle the cases like (ss)(t), etc.

I should do more baking before I publicize my partially-baked ideas. I
thought some more about this while swimming, and I think the solution is
quite different than I present here, but will need some more baking to
be sure.

p5pRT · 2011-09-09T01:18:00Z

From [email protected]

"karl williamson via RT" <perlbug-followup@perl.org> wrote
on Wed, 07 Sep 2011 18:09:07 PDT:

It's not a bug in unicode_strings, as that is irrelevant here,
since the string is in utf8.

Hm, then how come -E gets a different answer than -e gets?
Perhaps I botched the tests.

It is a bug in regcomp.c; and it is from my
overlooking the situation specified in the bug report. I am thinking
about solutions; probably a static analysis done under control of regen
to look for cases where the tail of a multichar fold can be the head of
another, and then have regcomp look for those and substitute in an
appropriate pattern.

Yes, there are a few of those.

Besides the latin ff/fi/ffl/etc ligatures, there are many Greek
code points whose multichar folds *end* with a small iota and a
few whose multichar folds *begin* with a small iota. So those
are overlaps.

I'm sure this is the same bug, but I see that not all
combinations of FB00 (ff) and FB01 (fi) work.

% perl -le 'print "\x{FB00}\x{FB01}" =~ /ff/i || 0'
1
% perl -le 'print "\x{FB01}\x{FB00}" =~ /ff/i || 0'
0
% perl -le 'print "\x{FB00}\x{FB01}" =~ /fi/i || 0'
0
% perl -le 'print "\x{FB01}\x{FB00}" =~ /fi/i || 0'
1

Although full ones do:

% perl -le 'print "\x{FB00}\x{FB01}" =~ /fffi/i || 0'
1

Even this tricky bit:

% perl -le 'print "\x{FB00}\x{FB01}" =~ /f\x{FB03}/i || 0'
1

That last where it needs the casefolds of both sides is kinda
impressive: Ruby doesn't get that one right, although it does
get others right Perl fails at:

% ruby -le 'print "\uFB00\uFB01" =~ /ff/i ? "Pass": "Fail"'
Pass
% ruby -le 'print "\uFB00\uFB01" =~ /fi/i ? "Pass": "Fail"'
Pass
% ruby -le 'print "\uFB00\uFB01" =~ /fffi/i ? "Pass": "Fail"'
Pass
% ruby -le 'print "\uFB00\uFB01" =~ /f\uFB03/i ? "Pass": "Fail"'
Fail

And similarly, we get this one:

% perl -le 'print "\x{FB00}i" =~ /f\x{FB01}/i ? "Pass" : "Fail"'
Pass

But Ruby doesn't:

% ruby -le 'print "\uFB00i" =~ /f\uFB01/i ? "Pass" : "Fail"'
Fail

That's kinda cool that we manage that. Good going, Karl!

I looked at all 104 multichar folds, and the only ones that look like they
overlap in Latin are these:

ﬀ U+FB00 fc=ff LATIN SMALL LIGATURE FF
ﬃ U+FB03 fc=ffi LATIN SMALL LIGATURE FFI
ﬁ U+FB01 fc=fi LATIN SMALL LIGATURE FI
İ U+0130 fc=i○̇ LATIN CAPITAL LETTER I WITH DOT ABOVE

ﬀ U+FB00 fc=ff LATIN SMALL LIGATURE FF
ﬄ U+FB04 fc=ffl LATIN SMALL LIGATURE FFL
ﬂ U+FB02 fc=fl LATIN SMALL LIGATURE FL

ß U+00DF fc=ss LATIN SMALL LETTER SHARP S
ẞ U+1E9E fc=ss LATIN CAPITAL LETTER SHARP S
ﬅ U+FB05 fc=st LATIN SMALL LIGATURE LONG S T
ﬆ U+FB06 fc=st LATIN SMALL LIGATURE ST
ẗ U+1E97 fc=t○̈ LATIN SMALL LETTER T WITH DIAERESIS

There are lots of them in Greek though, because of the combinations
of those with multichar folds ending/starting in small iota.

In the case of sst it would be something like
(?i:sst|\x{df}t|s\x{fb05}) plus whatever else the analysis for this
situation calls for, and the /i matching in the result is restricted
to single char folds, so e.g., \x{df} will match its capital, but not
expand out to 'ss' again.

I'm not sure I see how that would (or wouldn't) work. Why only singles?

Shouldn't /sst/i match /ss/i then /t/i or else /s/i then /st/i? That is,
/sst/i should be exactly the same, given the current folding tables, as:

(?x: (?: [\x{53}\x{73}\x{173}]{2} | [\x{DF}\x{1E93}] ) t
| [\x{53}\x{73}\x{173}] [\x{FB05}\x{FB06}] )

And also as this, no matter the folding tables:

/ss/ig && /\Gt/i
||
/s/ig && /\Gst/i

Right?

0053 LATIN CAPITAL LETTER S (fc=s)
0073 LATIN SMALL LETTER S (fc=s)
017F LATIN SMALL LETTER LONG S (fc=s)

00DF LATIN SMALL LETTER SHARP S (fc=ss)
1E9E LATIN CAPITAL LETTER SHARP S (fc=ss)

FB05 LATIN SMALL LIGATURE LONG S T (fc=st)
FB06 LATIN SMALL LIGATURE ST (fc=st)

I guess it's a little more complicated than that, since /sst./i
should match any of these:

2 code points:
either sharp s, followed by a small t with diaeresis
3 code points:
any s, followed by either st, followed by any code point
3 code points:
two of any s, followed by a small t with diaeresis
4 code points:
two of any s, followed by either t, followed by any code point

If you're doing to be rewriting that as alternatives, I wonder
about the ordering affecting choices. Hm.

This still won't handle the cases like (ss)(t), etc.

I'm not especially worried about the capture part. For that at least, and
maybe for the other cases as well, certainly fc($string) =~ /pattern/ would
be a lot easier than $string =~ /pattern/i, but you lose control over which
part is case insensitive. Alas.

--tom

p5pRT · 2011-09-14T04:42:36Z

From @khwilliamson

On 09/08/2011 07:16 PM, Tom Christiansen wrote:

"karl williamson via RT"<perlbug-followup@perl.org> wrote
on Wed, 07 Sep 2011 18:09:07 PDT:

It's not a bug in unicode_strings, as that is irrelevant here,
since the string is in utf8.

Hm, then how come -E gets a different answer than -e gets?
Perhaps I botched the tests.

It is indeed a bug in regcomp.c, and its implementation of unicode_strings.

It is a bug in regcomp.c; and it is from my
overlooking the situation specified in the bug report. I am thinking
about solutions; probably a static analysis done under control of regen
to look for cases where the tail of a multichar fold can be the head of
another, and then have regcomp look for those and substitute in an
appropriate pattern.

Yes, there are a few of those.

This is a red herring, as I tried to point out in an email followup to
the message you're quoting. It turns out that there are several
different bugs. I was wrong about the overlap, except for these cases:
GREEK_SMALL_LETTER_UPSILON_WITH_DIALYTIKA_AND_TONOS:
GREEK_SMALL_LETTER_IOTA_WITH_DIALYTIKA_AND_TONOS:
LATIN_SMALL_LETTER_SHARP_S:
GREEK SMALL LETTER IOTA WITH DIALYTIKA AND OXIA
GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND OXIA

and their capitals, and what they fold to. The reason there is is an
overlap problem stems from attempts, starting quite a few releases ago,
to get around a deficiency in the regex optimizer, which everyone is
scared to touch. I made some fixes in 5.14, but which broke other
things, as you guys have discovered.

Besides the latin ff/fi/ffl/etc ligatures, there are many Greek
code points whose multichar folds *end* with a small iota and a
few whose multichar folds *begin* with a small iota. So those
are overlaps.

I'm sure this is the same bug, but I see that not all
combinations of FB00 (ff) and FB01 (fi) work.

Actually the rest are not the same bug. They appear to be in regexec.c.
I don't know when the bug got introduced; likely it's been there all
along.

 % perl \-le 'print "\\x\{FB00\}\\x\{FB01\}" =~ /ff/i || 0'
 1
 % perl \-le 'print "\\x\{FB01\}\\x\{FB00\}" =~ /ff/i || 0'
 0
 % perl \-le 'print "\\x\{FB00\}\\x\{FB01\}" =~ /fi/i || 0'
 0
 % perl \-le 'print "\\x\{FB01\}\\x\{FB00\}" =~ /fi/i || 0'
 1
Although full ones do:
 % perl \-le 'print "\\x\{FB00\}\\x\{FB01\}" =~ /fffi/i || 0'
 1
Even this tricky bit:
 % perl \-le 'print "\\x\{FB00\}\\x\{FB01\}" =~ /f\\x\{FB03\}/i || 0'
 1
That last where it needs the casefolds of both sides is kinda
impressive: Ruby doesn't get that one right, although it does
get others right Perl fails at:
 % ruby \-le 'print "\\uFB00\\uFB01" =~ /ff/i ? "Pass"&#8203;: "Fail"'
 Pass
 % ruby \-le 'print "\\uFB00\\uFB01" =~ /fi/i ? "Pass"&#8203;: "Fail"'
 Pass
 % ruby \-le 'print "\\uFB00\\uFB01" =~ /fffi/i ? "Pass"&#8203;: "Fail"'
 Pass
 % ruby \-le 'print "\\uFB00\\uFB01" =~ /f\\uFB03/i ? "Pass"&#8203;: "Fail"'
 Fail
And similarly, we get this one:
 % perl \-le 'print "\\x\{FB00\}i" =~ /f\\x\{FB01\}/i ? "Pass" : "Fail"'
 Pass
But Ruby doesn't:
 % ruby \-le 'print "\\uFB00i" =~ /f\\uFB01/i ? "Pass" : "Fail"'
 Fail
That's kinda cool that we manage that. Good going, Karl!

I looked at all 104 multichar folds, and the only ones that look like they
overlap in Latin are these:
 ﬀ  U\+FB00 fc=ff     LATIN SMALL LIGATURE FF
 ﬃ  U\+FB03 fc=ffi    LATIN SMALL LIGATURE FFI
 ﬁ  U\+FB01 fc=fi     LATIN SMALL LIGATURE FI
 İ  U\+0130 fc=i○̇     LATIN CAPITAL LETTER I WITH DOT ABOVE

 ﬀ  U\+FB00 fc=ff     LATIN SMALL LIGATURE FF
 ﬄ  U\+FB04 fc=ffl    LATIN SMALL LIGATURE FFL
 ﬂ  U\+FB02 fc=fl     LATIN SMALL LIGATURE FL

 ß  U\+00DF fc=ss     LATIN SMALL LETTER SHARP S
 ẞ  U\+1E9E fc=ss     LATIN CAPITAL LETTER SHARP S
 ﬅ  U\+FB05 fc=st     LATIN SMALL LIGATURE LONG S T
 ﬆ  U\+FB06 fc=st     LATIN SMALL LIGATURE ST
 ẗ  U\+1E97 fc=t○̈     LATIN SMALL LETTER T WITH DIAERESIS
There are lots of them in Greek though, because of the combinations
of those with multichar folds ending/starting in small iota.

In the case of sst it would be something like
(?i:sst|\x{df}t|s\x{fb05}) plus whatever else the analysis for this
situation calls for, and the /i matching in the result is restricted
to single char folds, so e.g., \x{df} will match its capital, but not
expand out to 'ss' again.

I'm not sure I see how that would (or wouldn't) work. Why only singles?

Shouldn't /sst/i match /ss/i then /t/i or else /s/i then /st/i? That is,
/sst/i should be exactly the same, given the current folding tables, as:
 $?x&#8203;: \(?&#8203;: \[\\x\{53\}\\x\{73\}\\x\{173\}\]\{2\} | \[\\x\{DF\}\\x\{1E93\}\] $ t
    | \[\\x\{53\}\\x\{73\}\\x\{173\}\] \[\\x\{FB05\}\\x\{FB06\}\]  \)
And also as this, no matter the folding tables:
/ss/ig&&  /\\Gt/i
 ||
 /s/ig&&  /\\Gst/i
Right?

I haven't examined this closely, since it's based on a false premise
about the cause of the problem.

 0053   LATIN CAPITAL LETTER S           \(fc=s\)
 0073   LATIN SMALL LETTER S             \(fc=s\)
 017F   LATIN SMALL LETTER LONG S        \(fc=s\)

 00DF   LATIN SMALL LETTER SHARP S       \(fc=ss\)
 1E9E   LATIN CAPITAL LETTER SHARP S     \(fc=ss\)

 FB05   LATIN SMALL LIGATURE LONG S T    \(fc=st\)
 FB06   LATIN SMALL LIGATURE ST          \(fc=st\)

I guess it's a little more complicated than that, since /sst./i
should match any of these:

 2 code points&#8203;:
     either sharp s\, followed by a small t with diaeresis
 3 code points&#8203;:
     any s\, followed by either st\, followed by any code point
 3 code points&#8203;:
     two of any s\, followed by a small t with diaeresis
 4 code points&#8203;:
     two of any s\, followed by either t\, followed by any code point

If you're doing to be rewriting that as alternatives, I wonder
about the ordering affecting choices. Hm.

I think the solution will be to avoid the whole overlap problem, and
address the optimizer issue more directly. Then this kind of stuff
doesn't come up. But that solution is painful, or else it would have
been done a long time ago, instead of the workarounds that turn out to
not solve it completely, much like Salzenberg's Law of Pretense

This still won't handle the cases like (ss)(t), etc.

I'm not especially worried about the capture part. For that at least, and
maybe for the other cases as well, certainly fc($string) =~ /pattern/ would
be a lot easier than $string =~ /pattern/i, but you lose control over which
part is case insensitive. Alas.

It's not just the capture. The same problem occurs if there are
quantifiers, or non-capturing clusters, or character classes [a-z][st][tu].

It would be helpful if you would keep looking in your copious free time
:) for other cases like these, and even more helpful if you could frame
them in terms of a TODO patch.

p5pRT · 2011-10-14T02:47:23Z

From @khwilliamson

On 09/08/2011 07:16 PM, Tom Christiansen wrote:

I'm sure this is the same bug, but I see that not all
combinations of FB00 (ff) and FB01 (fi) work.

 % perl \-le 'print "\\x\{FB00\}\\x\{FB01\}" =~ /ff/i || 0'
 1
 % perl \-le 'print "\\x\{FB01\}\\x\{FB00\}" =~ /ff/i || 0'
 0
 % perl \-le 'print "\\x\{FB00\}\\x\{FB01\}" =~ /fi/i || 0'
 0
 % perl \-le 'print "\\x\{FB01\}\\x\{FB00\}" =~ /fi/i || 0'
 1

As I said in an earlier email, this is not the same bug as the ones
involving the sharp SS. And these are now fixed, leaving this
particular symptom (I hope) only for the 3 "tricky" fold characters and
their folds. Those require significantly more work than this trivial 1
line patch:

commit 7c1b9f3
Author: Karl Williamson <public@khwilliamson.com>
Date: Thu Oct 13 19:56:45 2011 -0600

regexec.c: Fix "\x{FB01}\x{FB00}" =~ /ff/i

Only the first character of the string was being checked when scanning
for the beginning position of the pattern match.

This was so wrong, it looks like it has to be a regression. I
experimented a little and did not find any. I believe (but am not
certain) that a multi-char fold has to be involved. The the
handling of
these was so broken before 5.14 that there very well may not be a
regression.

p5pRT · 2012-01-19T22:45:06Z

From @khwilliamson

The last bit of this was fixed in commit
bb91448
--
Karl Williamson

p5pRT · 2012-01-19T22:45:06Z

From [Unknown Contact. See original ticket]

The last bit of this was fixed in commit
bb91448
--
Karl Williamson

p5pRT · 2012-01-19T22:45:06Z

@khwilliamson - Status changed from 'open' to 'resolved'

p5pRT closed this as completed Jan 19, 2012

p5pRT added Severity Low distro-openbsd labels Oct 19, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

"the Unicode bug", reversed? #11635

"the Unicode bug", reversed? #11635

p5pRT commented Sep 6, 2011

p5pRT commented Sep 6, 2011

Uh oh!

p5pRT commented Sep 6, 2011

Uh oh!

p5pRT commented Sep 6, 2011

Uh oh!

p5pRT commented Sep 7, 2011

Uh oh!

p5pRT commented Sep 8, 2011

Uh oh!

p5pRT commented Sep 8, 2011

Uh oh!

p5pRT commented Sep 8, 2011

Uh oh!

p5pRT commented Sep 9, 2011

Uh oh!

p5pRT commented Sep 14, 2011

Uh oh!

p5pRT commented Oct 14, 2011

Uh oh!

p5pRT commented Jan 19, 2012

Uh oh!

p5pRT commented Jan 19, 2012

Uh oh!

p5pRT commented Jan 19, 2012

Uh oh!

"the Unicode bug", reversed? #11635

"the Unicode bug", reversed? #11635

Comments

p5pRT commented Sep 6, 2011

p5pRT commented Sep 6, 2011

From [email protected]

Uh oh!

p5pRT commented Sep 6, 2011

From @rgarcia

Uh oh!

p5pRT commented Sep 6, 2011

Uh oh!

p5pRT commented Sep 7, 2011

From [email protected]

Uh oh!

p5pRT commented Sep 8, 2011

From @khwilliamson

Uh oh!

p5pRT commented Sep 8, 2011

From @chipdude

Uh oh!

p5pRT commented Sep 8, 2011

From @khwilliamson

Uh oh!

p5pRT commented Sep 9, 2011

From [email protected]

Uh oh!

p5pRT commented Sep 14, 2011

From @khwilliamson

Uh oh!

p5pRT commented Oct 14, 2011

From @khwilliamson

Uh oh!

p5pRT commented Jan 19, 2012

From @khwilliamson

Uh oh!

p5pRT commented Jan 19, 2012

From [Unknown Contact. See original ticket]

Uh oh!

p5pRT commented Jan 19, 2012

Uh oh!