-
Notifications
You must be signed in to change notification settings - Fork 577
quotemeta() fails to quote literal non-word character under utf8 #10602
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
From [email protected]Created by [email protected]quotemeta() fails to quote a CENT SIGN when, ---- use utf8; # Bug Synopsis # quotemeta() fails to quote a CENT SIGN when, ok("¢","\xA2"); # ok ok(quotemeta("¢"),"\\¢"); # NOT OK # Bug Demonstration my $a = "¢"; # Additional notes # CENT SIGN is \xA2 ---- 1..21 Perl Info
|
From @iabynOn Thu, Sep 02, 2010 at 12:58:16PM -0700, Mitchell N Charity wrote:
This appears to be down to a difference in behaviour of quotemeta For non-utf8 strings, all chars *except* isALNUM() are \\-escaped; in For utf8 strings, chars with ord > 127 are never quoted. I think this The current docs make it clear that all chars except [A-Za-z_0-9] should -- |
The RT System itself - Status changed from 'new' to 'open' |
From [email protected]
I believe Unicode makes some guarantees regarding the stability of the --tom |
From [email protected]
I've been thinking about this a bit more, rereading UAX#31, UTS#18, and I *think* that is what Dave is suggesting, not that merely all [^\x00-\x7F] But I have encountered a problem with that idea. Unicode defines certain The important bits are from: http://unicode.org/reports/tr31/#Pattern_Syntax As of Unicode4.1, two Unicode character properties are defined to For stability, the values of these properties are absolutely invariant, When *generating* rules or patterns, all whitespace and syntax code There's more there, which should probably be studied before we One would think that backslashing all \p{Pattern_Syntax} characters % unichars -c '\p{Pattern_Syntax}' '\w' I don't know whether that is a mistake or not. Karl? There are also two code points that are Pattern_White_Space but not % unichars -c '\p{Pattern_White_Space}' '\P{White_Space}' Which I'm not sure what to make up. For what it's worth (which probably is nothing), there are 63 I believe there to be no changes to the sets of things I've --tom ELABORATION: The reason Perl quotes all \W characters is because of first principles That principle is that, in patterns: * a \w character never means anything special Whence it follows that * backslashing a \w character might mean something special In point of fact, there are uniquely 12 and 12 only metacharacters \ | ( ) [ { ^ $ * + ? . The question becomes whether we want the flexibility to someday extend We've never drawn upon our the \W reservoir for other pattern matching <:Letter> # \pL <:East_Asian_Width<Narrow>> # \p{EA=N} For another thing, Perl6 uses "~" for matching nested subrules and uses I do not know whether one can add new metacharacters in Perl6 patterns. I have proposed that we adopt a way to specify character class union, sub IsKana { which was used back before we had a proper Kana property (we now do). Even if we did something Java's character class set mechanics (as I The Unicode documents use a cleaner syntax than Java's for talking |
From @khwilliamsonTom Christiansen wrote:
I believe that UTR18 and UTS18 are now the same document. My first thought was that for Unicode, that all \W characters
Note that some of the code points in the sets are still unassigned, so
I have emailed Unicode about this apparent discrepancy.
They are, however, default ignorable code points, so it is recommended
So I don't know what to do. This may be complicated by the fact that The Unicode recommendation is to only quote the pattern white space and I believe Tom has a better handle on the implications than me. I await |
From [email protected]SUMMARY: I believe that if nothing substantial can be gained by I also think we should use those suggestions if there were some Karl wrote:
Good, thank you.
Could you please explain what that means, that Unicode botched the My working definitions of an alpha and an idenitifier charclass alphabetic_charclass = identifier_charclass = Now, that's not quite the way #31's section 2 reads, but it may be What part of the sense of "alpha" or "identifier" did Perl and Unicode
That would save on space.
Modulo the problematic U+2E2F, I believe that quoting all \W characters Pattern_Syntax Let's for this discussion call those the Pattern_Quotable set, or PQ. The considerations are time and space. On space, there are certainly more % unichars '[\p{Pattern_Syntax}\p{Pattern_White_Space}\p{Default_Ignorable_Code_Point}]' | wc -l % unichars '\W' | wc -l Adding Unassigned, PrivateUse, Han, and InHangulSyllables produces % unichars -u '[\p{Pattern_Syntax}\p{Pattern_White_Space}\p{Default_Ignorable_Code_Point}]' | wc -l And of course a substantial gain on the \W set: % unichars -u '\W' | wc -l I am somewhat surprrised to see more identifier characters % unichars '[\p{Pattern_Syntax}\p{Pattern_White_Space}\p{Default_Ignorable_Code_Point}]' '\w'
Maybe, maybe not.
It would save us on space to make quotemeta working on the smaller PQ set I mean apart from the obvious that it takes time to allocate more stuff I don't really know much how the swatches work, nor the true costs of If there is nothing substantial to be gained by using the broader \W over --tom |
From @khwilliamsonTom Christiansen wrote:
I've gotten a (rapid) preliminary response. Their definition of \w
Here are my comments in mktables, added when I researched the problem: And here are the comments from handy.h:
See the comments above. Perl doesn't use IDStart at all. Instead it That definition has only been in place for some of the 5.13.X releases. This caused the parser to loop on some inputs. The details are in the
On space, there are certainly more
There's something wrong if this includes only the first 16 variation
I think the differences in time/space are in the noise. swashes aren't
|
From @ikegamiOn Thu, Dec 16, 2010 at 2:47 PM, Tom Christiansen <tchrist@perl.com> wrote:
"-" and "^" are meta in certain positions. /[^a]/ vs /[\^a]/ |
From [email protected]
I elsewhere wrote that charclasses operate under different rules. --tom |
From @AbigailOn Thu, Dec 16, 2010 at 12:47:21PM -0700, Tom Christiansen wrote:
I've always wondered why a lone } or ] does not need escaping (they're
And I don't think Perl5 every will. There's so much code out there that Abigail |
From [email protected]
So have I. It could be worse: things like quantifiers still
Now that you mention it, you're right, we do. Hadn't thought of that. --tom |
From @iabynOn Fri, Dec 17, 2010 at 08:11:15AM -0700, Tom Christiansen wrote:
Ok. How about the following resolution: we change it so that utf8 strings -- |
From @khwilliamsonOn 12/29/2010 03:57 AM, Dave Mitchell wrote:
This proposal and all others died in 5.14 for lack of consensus. This I'm thinking we should just do what the original trouble ticket asks I'm reopening this publicly now, in order to try to get resolution in If we do this, does that close the door on later changing to use the |
From @demerphqOn 29 December 2010 11:57, Dave Mitchell <davem@iabyn.com> wrote:
I think it depends on what we want to do. If quotemeta() is intended Some of the options are: 1) make quotemeta() *not* escape codepoints>127 regardless In terms of back-compat your suggestion (2) or my suggestion (1) are BUT, option 3 has some things to be said for it. Specifically, its The efficiency point is also why I think that escaping codepoints we Also, as an aside to the cc list: I do not think that what Unicode cheers, -- |
From @demerphqOn 6 February 2012 14:41, demerphq <demerphq@gmail.com> wrote:
To be clear I meant codepoints where: 127 < codepoint < 256 Sorry for the extra mail... Yves |
From [email protected]
Disagree, several times over. First of all, the Pattern_Syntax Unicode character property is *not* Rather, it is a defined in UAX#44, the Unicode Character Database. That Lastly, both those properties are, like the names of the characters Now please stop repeating this nonsense about Unicode being a moving As for casefolding, we have *not* "wasted a lot of time". But I am not --tom |
From @demerphqOn 6 February 2012 15:13, Tom Christiansen <tchrist@perl.com> wrote:
Not sure how this is relevant.
Given that we reserve the right to add new regex meta characters if we
I have personal experience with Unicode being a moving target. For
It is entirely possible I am misinformed, but this is my impression of Anyway, as and when you have time I would like to hear more of your cheers, -- |
From @nwc10On Mon, Feb 06, 2012 at 05:13:47PM +0100, demerphq wrote:
Aspects of Unicode aren't fixed yet. Being on the bleeding edge of "ß" =~ /ss/i um, has "issues" about what exactly $1 and $2 should be for the capturing "s" =~ /^[^ß]/
In the passing mailing list traffic I didn't spot anything that made me Mainly it's that a lot of what they define is fundamentally *hard* to Nicholas Clark |
From @khwilliamsonOn 02/06/2012 10:03 AM, Nicholas Clark wrote:
I believe it's decisions they haven't finalized yet. Indications are If they do back away, then perhaps we will have made wasted effort. |
From @ikegamiOn Mon, Feb 6, 2012 at 8:41 AM, demerphq <demerphq@gmail.com> wrote:
That's not very safe. It prevents storing the escaped pattern and using it
Also mentioned was: 4) make quotemeta() escape some code-points above 127. (\W, Analysis: (worst-to-best) (3) is the least forward-compatible. (3) is the least backward-compatible (e.g. it would no longer escape "&"). (3) is the most dangerous, affecting characters below 127 (e.g. some might (3) is faster than (1), (2) and (4) if you think the time spent parsing "\" - Eric |
From @khwilliamsonOn 02/06/2012 01:19 PM, Eric Brine wrote:
Thanks for the analysis. I'd like to throw this comment in from this
|
From @ikegamiOn Mon, Feb 6, 2012 at 6:37 PM, Karl Williamson <public@khwilliamson.com>wrote:
|
From @demerphqOn 6 February 2012 21:19, Eric Brine <ikegami@adaelis.com> wrote:
I do not have stats to back me up, but knowing how the code handles Yves -- |
From @demerphqOn 7 February 2012 01:34, Eric Brine <ikegami@adaelis.com> wrote:
devils advocate: :-) Yves -- |
From @AbigailOn Tue, Feb 07, 2012 at 02:22:34AM +0100, demerphq wrote:
And break an ancient promise? [1] ;-) Unlike some other regular expression languages, there are no backslashed This is in the current manual page, but the exact same phrasing already [1] 22 years counts as ancient. Abigail |
From @demerphqOn 7 February 2012 03:06, Abigail <abigail@abigail.be> wrote:
Yes right, my bad. Did not think my post through before I sent it. Yves -- |
From @khwilliamsonOn 02/06/2012 05:34 PM, Eric Brine wrote:
I've looked over this thread now several times and re-read Unicode's UAX It essentially suggests that characters that are \p{Pattern_Syntax} are Unicode also defines a few characters (also attached, and also UAX 31 also suggests that for readability all other white space (6.1 If Perl is willing to never use other than a pattern syntax character as Another reasonable basis is to use \W, which Tom has pointed earlier in Thus, I'm coming down to Tom's conclusion that if we do quoting based on People have talked about the speed of parsing quoted characters. But I have now formulated the following proposal: Non-utf8 string, not feature unicode_strings: Otherwise, It may be that we decide we will never use anything outside the dozen we This solution is completely backwards compatible in the ASCII range. The solution isn't backwards compatible above Latin1; nothing we do is, |
From @khwilliamson |
From @khwilliamson# !!!!!!! DO NOT EDIT THIS FILE !!!!!!! # !!!!!!! INTERNAL PERL USE ONLY !!!!!!! # Use Unicode::UCD::prop_invlist() to access the contents of this file. return <<'END' =~ s/\s*#.*//mgr; |
From @khwilliamson# !!!!!!! DO NOT EDIT THIS FILE !!!!!!! # !!!!!!! INTERNAL PERL USE ONLY !!!!!!! # Use Unicode::UCD::prop_invlist() to access the contents of this file. return <<'END' =~ s/\s*#.*//mgr; |
From @khwilliamson# !!!!!!! DO NOT EDIT THIS FILE !!!!!!! # !!!!!!! INTERNAL PERL USE ONLY !!!!!!! # Use Unicode::UCD::prop_invlist() to access the contents of this file. return <<'END' =~ s/\s*#.*//mgr; |
From [email protected]I never thought to check unassigned code points for properties. No room in PatWS, but LRM and RLM are \S. (Well, so is \cK, but that's only because we haven't fixed that yet to make Not sure what all the unassigned DI code points up in E0080–E00FF --tom |
From @nwc10On Tue, Feb 07, 2012 at 12:22:30PM -0700, Karl Williamson wrote:
I don't see the approach of "yet another feature" as scaling. We'd likely as
"backwards" compatible or "bugwards" compatible? I'm finding it hard to Nicholas Clark |
From @khwilliamsonOn 02/08/2012 04:36 AM, Nicholas Clark wrote:
I was hoping that would be people's sentiment about this. :)
Totally agree. So yet another option is to just fix the Unicode bug portion of this for We could use unicode_strings as a flag for the upper Latin1 range If it is on, we treat them as we've always treated above-Latin1 range Thus the only inconsistency is between non-unicode_strings and |
From @khwilliamsonOn 02/08/2012 10:23 AM, Karl Williamson wrote:
If we go the pattern syntax route, I think we should quote the controls |
From @khwilliamsonI have mostly implemented what I last proposed, but attached is a doc I'm also thinking that under locale, quotemeta should just quote \W for |
From @khwilliamson0002-temp-for-comment.patchFrom 1607ec47dcb28ecd2687333d4f0d759eb0479312 Mon Sep 17 00:00:00 2001
From: Karl Williamson <[email protected]>
Date: Sun, 12 Feb 2012 09:41:25 -0700
Subject: [PATCH 2/2] temp for comment
---
pod/perlfunc.pod | 48 ++++++++++++++++++++++++++++++++++++++++++++++--
1 files changed, 46 insertions(+), 2 deletions(-)
diff --git a/pod/perlfunc.pod b/pod/perlfunc.pod
index 591fa0d..ad8b7b5 100644
--- a/pod/perlfunc.pod
+++ b/pod/perlfunc.pod
@@ -4953,8 +4953,52 @@ input from the user, quotemeta() or C<\Q> must be used.
In Perl v5.14, all non-ASCII characters are quoted in non-UTF-8-encoded
strings, but not quoted in UTF-8 strings.
-It is planned to change this behavior in v5.16, but the exact rules
-haven't been determined yet.
+
+Starting in Perl v5.16, Perl adopted a Unicode-defined strategy
+for quoting non-ASCII characters; the quoting of ASCII characters is
+unchanged.
+
+Also unchanged is the quoting for non-UTF-8 strings when outside the
+scope of a C<use feature 'unicode_strings'>, which is to quote all
+characters in the upper Latin1 range. This provides complete backwards
+compatibility for old programs which do not use Unicode (but note that
+C<unicode_strings> is automatically enabled within the scope of a
+S<C<use v5.12>> or greater).
+
+Otherwise, Perl quotes non-ASCII characters using an adaptation from
+Unicode (see L<http://www.unicode.org/reports/tr31/>.)
+The only code points that are quoted are those that have any of the
+Unicode properties Pattern_Syntax, Pattern_White_Space, White_Space,
+Default_Ignorable_Code_Point, or General_Category=Control.
+
+Of these properties, the two important ones are Pattern_Syntax and
+Pattern_White_Space. They have been set up by Unicode for exactly this
+purpose of deciding which characters in a regular expression pattern
+should be quoted. No character that can be in an identifier has these
+properties.
+
+Perl promises, that if we ever add regular expression pattern
+metacharacters to the dozen already defined
+(C<\ E<verbar> ( ) [ { ^ $ * + ? .>), that we will only use ones that have the
+Pattern_Syntax property. Perl also promises, that if we ever add
+characters that are considered to be white space in regular expressions
+(currently mostly affected by C</x>), they will all have the
+Pattern_White_Space property.
+
+Unicode promises that the set of code points that have these two
+properties will never change, so something that is not quoted in v5.16
+will never need to be quoted in any future Perl release. (Not all the
+code points that match Pattern_Syntax have actually had characters
+assigned to them; so there is room to grow, but they are quoted
+whether assigned or not. Perl, of course, would never use an
+unassigned code point as an actual metacharacter.)
+
+Quoting characters that have the other 3 properties is done to enhance
+the readability of the regular expression and not because they actually
+need to be quoted (characters with the White_Space property are likely
+to be indistinguishable on the page or screen from those with the
+Pattern_White_Space property; and the other two properties contain
+non-printing characters).
=item rand EXPR
X<rand> X<random>
--
1.7.7.1
|
From @rjbs* Karl Williamson <public@khwilliamson.com> [2012-02-12T11:47:28]
...and I see that all characters that are ASCII and Pattern_Syntax are already Cool. -- |
From @khwilliamsonNow fixed by commit 2e2b257 |
From [Unknown Contact. See original ticket]Now fixed by commit 2e2b257 |
@khwilliamson - Status changed from 'open' to 'resolved' |
Migrated from rt.perl.org#77654 (status was 'resolved')
Searchable as RT77654$
The text was updated successfully, but these errors were encountered: