-
Notifications
You must be signed in to change notification settings - Fork 583
"the Unicode bug", reversed? #11635
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
From [email protected]Summary: If you use -E, matches fail that work fine under -e. This is Matthew Barnett, who is implementing full casefolding in Python, However, these match: "\N{LATIN SMALL LETTER SHARP S}" =~ /ss/i but these don't match: "s\N{LATIN SMALL LIGATURE LONG S T}" =~ /sst/i I think what might be happening is that it isn't handling the When it sees "sst" in the regex it identifies "ss" as a possible result ss => ss|\N{LATIN SMALL LETTER SHARP S} but it then doesn't identify "st" as another possible result of full st => st|\N{LATIN SMALL LIGATURE ST} It should be doing: sst => sst|\N{LATIN SMALL LETTER SHARP S}t|s\N{LATIN SMALL LIGATURE ST} (Again, I'm ignoring the other alternative.) And it is indeed true that those two test cases fail, under both 5.14 and blead: This is perl 5, version 14, subversion 0 (v5.14.0) built for darwin-2level This is perl 5, version 15, subversion 2 (v5.15.2-264-g87e4a53) built for darwin-2level As shown here: % perl -Mcharnames=:full -lE 'print "s\N{LATIN SMALL LIGATURE LONG S T}" =~ /sst/i ? "Pass" : "Fail"' However, merely change the -E to a -e, suddenly they work! % perl -Mcharnames=:full -le 'print "s\N{LATIN SMALL LIGATURE LONG S T}" =~ /sst/i ? "Pass" : "Fail"' So it looks like this is some reverse Unicode bug. Very strange. For the record, Ruby does get these right: % ruby 'print "s\uFB05" =~ /sst/i ? "Pass" : "Fail"' Where that is: % ruby -v Here are other, probably related issues: % perl -lE 'print "\x{FB05}" =~ /st/i ? "Pass" : "Fail"' However, unlike the early attempts, *those* do *not* suddenly pass if % perl -le 'print "\x{DF}\x{FB05}" =~ /st/i ? "Pass" : "Fail"' See; it still fails. Very strange. They work fine in Ruby: % ruby -le 'print "\uFB05" =~ /st/i ? "Pass" : "Fail"' Like Perl, Ruby does *not* do partial matches of full casefolds % perl -lE 'print "\x{DF}\x{FB05}" =~ /ssst/i ? "Pass" : "Fail"' % perl -lE 'print "\x{DF}\x{FB05}" =~ /sst/i ? "Pass" : "Fail"' Which is as expected. The others aren't. --tom Summary of my perl5 (revision 5 version 14 subversion 0) configuration: Characteristics of this binary (from libperl): |
From @rgarciaOn 6 September 2011 17:29, tchrist1 wrote:
That would be because -E turns on all current features, including
Now, I would let Karl comment on whether this is a bug in |
The RT System itself - Status changed from 'new' to 'open' |
From [email protected]"Rafael Garcia-Suarez via RT" <perlbug-followup@perl.org> wrote
Yes, I realize that. Normally, "The Unicode Bug" is cleared up by
I can't really see how it wouldn't be. --tom |
From @khwilliamsonOn 09/07/2011 09:30 AM, Tom Christiansen wrote:
It's not a bug in unicode_strings, as that is irrelevant here, since the This still won't handle the cases like (ss)(t), etc. |
From @chipdudeOn 9/7/2011 6:07 PM, Karl Williamson wrote:
I've formulated a law that covers such things. http://chip.typepad.com/weblog/2011/09/salzenbergs-law-of-pretense.html Related: "Do not lie to thy compiler, for it shall get its revenge." - |
From @khwilliamsonOn 09/07/2011 07:07 PM, Karl Williamson wrote:
I should do more baking before I publicize my partially-baked ideas. I |
From [email protected]"karl williamson via RT" <perlbug-followup@perl.org> wrote
Hm, then how come -E gets a different answer than -e gets?
Yes, there are a few of those. Besides the latin ff/fi/ffl/etc ligatures, there are many Greek I'm sure this is the same bug, but I see that not all % perl -le 'print "\x{FB00}\x{FB01}" =~ /ff/i || 0' Although full ones do: % perl -le 'print "\x{FB00}\x{FB01}" =~ /fffi/i || 0' Even this tricky bit: % perl -le 'print "\x{FB00}\x{FB01}" =~ /f\x{FB03}/i || 0' That last where it needs the casefolds of both sides is kinda % ruby -le 'print "\uFB00\uFB01" =~ /ff/i ? "Pass": "Fail"' And similarly, we get this one: % perl -le 'print "\x{FB00}i" =~ /f\x{FB01}/i ? "Pass" : "Fail"' But Ruby doesn't: % ruby -le 'print "\uFB00i" =~ /f\uFB01/i ? "Pass" : "Fail"' That's kinda cool that we manage that. Good going, Karl! I looked at all 104 multichar folds, and the only ones that look like they ff U+FB00 fc=ff LATIN SMALL LIGATURE FF ff U+FB00 fc=ff LATIN SMALL LIGATURE FF ß U+00DF fc=ss LATIN SMALL LETTER SHARP S There are lots of them in Greek though, because of the combinations
I'm not sure I see how that would (or wouldn't) work. Why only singles? Shouldn't /sst/i match /ss/i then /t/i or else /s/i then /st/i? That is, (?x: (?: [\x{53}\x{73}\x{173}]{2} | [\x{DF}\x{1E93}] ) t And also as this, no matter the folding tables: /ss/ig && /\Gt/i Right? 0053 LATIN CAPITAL LETTER S (fc=s) 00DF LATIN SMALL LETTER SHARP S (fc=ss) FB05 LATIN SMALL LIGATURE LONG S T (fc=st) I guess it's a little more complicated than that, since /sst./i 2 code points: If you're doing to be rewriting that as alternatives, I wonder
I'm not especially worried about the capture part. For that at least, and --tom |
From @khwilliamsonOn 09/08/2011 07:16 PM, Tom Christiansen wrote:
It is indeed a bug in regcomp.c, and its implementation of unicode_strings.
This is a red herring, as I tried to point out in an email followup to and their capitals, and what they fold to. The reason there is is an
Actually the rest are not the same bug. They appear to be in regexec.c.
I haven't examined this closely, since it's based on a false premise
I think the solution will be to avoid the whole overlap problem, and
It's not just the capture. The same problem occurs if there are It would be helpful if you would keep looking in your copious free time |
From @khwilliamsonOn 09/08/2011 07:16 PM, Tom Christiansen wrote:
As I said in an earlier email, this is not the same bug as the ones commit 7c1b9f3 regexec.c: Fix "\x{FB01}\x{FB00}" =~ /ff/i Only the first character of the string was being checked when scanning This was so wrong, it looks like it has to be a regression. I |
From @khwilliamsonThe last bit of this was fixed in commit |
From [Unknown Contact. See original ticket]The last bit of this was fixed in commit |
@khwilliamson - Status changed from 'open' to 'resolved' |
Migrated from rt.perl.org#98546 (status was 'resolved')
Searchable as RT98546$
The text was updated successfully, but these errors were encountered: