Skip to content

/(?i:...)/ loses passed in charset #11967

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
p5pRT opened this issue Feb 20, 2012 · 12 comments
Closed

/(?i:...)/ loses passed in charset #11967

p5pRT opened this issue Feb 20, 2012 · 12 comments

Comments

@p5pRT
Copy link

p5pRT commented Feb 20, 2012

Migrated from rt.perl.org#111174 (status was 'resolved')

Searchable as RT111174$

@p5pRT
Copy link
Author

p5pRT commented Feb 20, 2012

From @khwilliamson

This is a bug report for perl from khw@​karl.(none),
generated with the help of perlbug 1.39 running under perl 5.15.7.


Setting flags in a regular expression using the (?foo​:...) notation
loses any passed in character set. for example through the -E flag.
So,
  perl -E ' "\xe0" =~ /(?i​:\w)/'
fails because the ?i​: destroys the memory that the -E was used, which
should have forced Unicode semantics on the Latin1 character \xe0.

Spotted by Yves Orton



Flags​:
  category=core
  severity=medium


Site configuration information for perl 5.15.7​:

Configured by khw at Mon Feb 20 07​:52​:27 MST 2012.

Summary of my perl5 (revision 5 version 15 subversion 7) configuration​:
  Commit id​: 2703178
  Platform​:
  osname=linux, osvers=2.6.35-32-generic-pae,
archname=i686-linux-thread-multi-64int-ld
  uname='linux karl 2.6.35-32-generic-pae #65-ubuntu smp tue jan 24
14​:06​:16 utc 2012 i686 gnulinux '
  config_args='-des -Dprefix=/home/khw/blead -Dusedevel
-D'optimize=-ggdb3' -A'optimize=-ggdb3' -A'optimize=-O0' -Dman1dir=none
-Dman3dir=none -DDEBUGGING -Dcc=g++ -Dusemorebits -Dusethreads'
  hint=recommended, useposix=true, d_sigaction=define
  useithreads=define, usemultiplicity=define
  useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
  use64bitint=define, use64bitall=undef, uselongdouble=define
  usemymalloc=n, bincompat5005=undef
  Compiler​:
  cc='g++', ccflags ='-D_REENTRANT -D_GNU_SOURCE -DDEBUGGING
-fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include
-D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
  optimize='-O0 -ggdb3',
  cppflags='-D_REENTRANT -D_GNU_SOURCE -DDEBUGGING
-fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include'
  ccversion='', gccversion='4.4.5', gccosandvers=''
  intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=12345678
  d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
  ivtype='long long', ivsize=8, nvtype='long double', nvsize=12,
Off_t='off_t', lseeksize=8
  alignbytes=4, prototype=define
  Linker and Libraries​:
  ld='g++', ldflags =' -fstack-protector -L/usr/local/lib'
  libpth=/usr/local/lib /lib/../lib /usr/lib/../lib /lib /usr/lib
/usr/lib/i686-linux-gnu
  libs=-lnsl -ldl -lm -lcrypt -lutil -lpthread -lc
  perllibs=-lnsl -ldl -lm -lcrypt -lutil -lpthread -lc
  libc=/lib/../lib/libc.so.6, so=so, useshrplib=false, libperl=libperl.a
  gnulibc_version='2.12'
  Dynamic Linking​:
  dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E'
  cccdlflags='-fPIC', lddlflags='-shared -ggdb3 -ggdb3 -O0
-L/usr/local/lib -fstack-protector'

Locally applied patches​:


@​INC for perl 5.15.7​:
  /home/khw/perl/blead/lib

/home/khw/blead/lib/perl5/site_perl/5.15.7/i686-linux-thread-multi-64int-ld
  /home/khw/blead/lib/perl5/site_perl/5.15.7
  /home/khw/blead/lib/perl5/5.15.7/i686-linux-thread-multi-64int-ld
  /home/khw/blead/lib/perl5/5.15.7
  /home/khw/blead/lib/perl5/site_perl
  .


Environment for perl 5.15.7​:
  HOME=/home/khw
  LANG=en_US.UTF-8
  LANGUAGE=en_US​:en
  LD_LIBRARY_PATH (unset)
  LOGDIR (unset)

PATH=/home/khw/bin​:/home/khw/print/bin​:/bin​:/usr/local/sbin​:/usr/local/bin​:/usr/sbin​:/usr/bin​:/sbin​:/usr/games​:/home/khw/cxoffice/bin
  PERL5OPT=-w
  PERL_BADLANG (unset)
  SHELL=/bin/ksh

@p5pRT
Copy link
Author

p5pRT commented Feb 20, 2012

From @khwilliamson

On 02/20/2012 11​:21 AM, karl williamson (via RT) wrote​:

Here is a patch for this bug, that was just spotted by Yves. This bug
has existed in all of 5.14.

Should this go into 5.16?

-----------------------------------------------------------------
Setting flags in a regular expression using the (?foo​:...) notation
loses any passed in character set. for example through the -E flag.
So,
perl -E ' "\xe0" =~ /(?i​:\w)/'
fails because the ?i​: destroys the memory that the -E was used, which
should have forced Unicode semantics on the Latin1 character \xe0.

Spotted by Yves Orton
-----------------------------------------------------------------

@p5pRT
Copy link
Author

p5pRT commented Feb 20, 2012

From @khwilliamson

0001-perl-111174-foo-.-loses-passed-in-charset.patch
From b06487a9b23cdb9e91baed3868ac2fa8881235db Mon Sep 17 00:00:00 2001
From: Karl Williamson <[email protected]>
Date: Mon, 20 Feb 2012 11:27:03 -0700
Subject: [PATCH 1/2] [perl #111174] (?foo:...) loses passed in charset

This commit looks for the passed-in charset, and overrides it only if it
is /d and the pattern requires /u.  Previously the passed-in value was
ignored.
---
 regcomp.c  |    9 ++++++---
 t/re/pat.t |   11 ++++++++++-
 2 files changed, 16 insertions(+), 4 deletions(-)

diff --git a/regcomp.c b/regcomp.c
index dd5a37c..a0597ca 100644
--- a/regcomp.c
+++ b/regcomp.c
@@ -8010,9 +8010,12 @@ S_reg(pTHX_ RExC_state_t *pRExC_state, I32 paren, I32 *flagp,U32 depth)
                 U32 posflags = 0, negflags = 0;
 	        U32 *flagsp = &posflags;
                 char has_charset_modifier = '\0';
-		regex_charset cs = (RExC_utf8 || RExC_uni_semantics)
-				    ? REGEX_UNICODE_CHARSET
-				    : REGEX_DEPENDS_CHARSET;
+		regex_charset cs = get_regex_charset(RExC_flags);
+		if (cs == REGEX_DEPENDS_CHARSET
+		    && (RExC_utf8 || RExC_uni_semantics))
+		{
+		    cs = REGEX_UNICODE_CHARSET;
+		}
 
 		while (*RExC_parse) {
 		    /* && strchr("iogcmsx", *RExC_parse) */
diff --git a/t/re/pat.t b/t/re/pat.t
index b4b7ac4..184f1f4 100644
--- a/t/re/pat.t
+++ b/t/re/pat.t
@@ -21,7 +21,7 @@ BEGIN {
     require './test.pl';
 }
 
-plan tests => 469;  # Update this when adding/deleting tests.
+plan tests => 472;  # Update this when adding/deleting tests.
 
 run_tests() unless caller;
 
@@ -1253,6 +1253,15 @@ EOP
         $anch_count++ while $str=~/^.*/mg;
         is $anch_count, 1, 'while "\n"=~/^.*/mg should match only once';
     }
+
+    { # [perl #111174]
+        use re '/u';
+        like "\xe0", qr/(?i:\xc0)/, "(?i: shouldn't lose the passed in /u";
+        use re '/a';
+        unlike "\x{100}", qr/(?i:\w)/, "(?i: shouldn't lose the passed in /a";
+        use re '/aa';
+        unlike 'k', qr/(?i:\N{KELVIN SIGN})/, "(?i: shouldn't lose the passed in /aa";
+    }
 } # End of sub run_tests
 
 1;
-- 
1.7.7.1

@p5pRT
Copy link
Author

p5pRT commented Feb 20, 2012

From @nwc10

On Mon, Feb 20, 2012 at 11​:42​:35AM -0700, Karl Williamson wrote​:

On 02/20/2012 11​:21 AM, karl williamson (via RT) wrote​:

Here is a patch for this bug, that was just spotted by Yves. This bug
has existed in all of 5.14.

Should this go into 5.16?

-----------------------------------------------------------------
Setting flags in a regular expression using the (?foo​:...) notation
loses any passed in character set. for example through the -E flag.
So,
perl -E ' "\xe0" =~ /(?i​:\w)/'
fails because the ?i​: destroys the memory that the -E was used, which
should have forced Unicode semantics on the Latin1 character \xe0.

Well, valgrind thinks that it looks like this​:

$ valgrind ./perl -Ilib -E ' "\xe0" =~ /(?i​:\w)/'
==69701== Memcheck, a memory error detector
==69701== Copyright (C) 2002-2011, and GNU GPL'd, by Julian Seward et al.
==69701== Using Valgrind-3.7.0 and LibVEX; rerun with -h for copyright info
==69701== Command​: ./perl -Ilib -E \ "\\xe0"\ =~\ /(?i​:\\w)/
==69701==
--69701-- ./perl​:
--69701-- dSYM directory is missing; consider using --dsymutil=yes
==69701== Invalid read of size 8
==69701== at 0x7FFFFFE00A71​: ???
==69701== by 0x100201C0E​: __inline_memmove_chk (in ./perl)
==69701== by 0x100215B31​: Perl_sv_setpvn (in ./perl)
==69701== by 0x100260BED​: Perl_newSVpv (in ./perl)
==69701== by 0x10004D793​: S_init_postdump_symbols (in ./perl)
==69701== by 0x10004E825​: S_parse_body (in ./perl)
==69701== by 0x10004BDD9​: perl_parse (in ./perl)
==69701== by 0x100001403​: main (in ./perl)
==69701== Address 0x100817e08 is 136 bytes inside a block of size 142 alloc'd
==69701== at 0x1004B95CF​: malloc (vg_replace_malloc.c​:266)
==69701== by 0x100160060​: Perl_safesysmalloc (in ./perl)
==69701== by 0x10016B4BA​: Perl_my_setenv (in ./perl)
==69701== by 0x10004B989​: perl_parse (in ./perl)
==69701== by 0x100001403​: main (in ./perl)
==69701==
==69701==
==69701== HEAP SUMMARY​:

That has some potential for mischief, doesn't it?

Nicholas Clark

@p5pRT
Copy link
Author

p5pRT commented Feb 20, 2012

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Feb 20, 2012

From @khwilliamson

On 02/20/2012 11​:46 AM, Nicholas Clark wrote​:

On Mon, Feb 20, 2012 at 11​:42​:35AM -0700, Karl Williamson wrote​:

On 02/20/2012 11​:21 AM, karl williamson (via RT) wrote​:

Here is a patch for this bug, that was just spotted by Yves. This bug
has existed in all of 5.14.

Should this go into 5.16?

-----------------------------------------------------------------
Setting flags in a regular expression using the (?foo​:...) notation
loses any passed in character set. for example through the -E flag.
So,
perl -E ' "\xe0" =~ /(?i​:\w)/'
fails because the ?i​: destroys the memory that the -E was used, which
should have forced Unicode semantics on the Latin1 character \xe0.

That was a poor choice of wording on my part. What I meant was that the
regex ignores the -E flag, and parsed the regex as if it had a /d
modifier. Similarly for a 'use re /a'; the /a gets ignored.

That means the issue raised below is from some other cause, and it
doesn't show up on my system either with blead or blead+patch, in any of
the configurations I normally build with. For example
config_args='-des -Dprefix=/home/khw/fastbleadperl -Dusedevel
-Dman1dir=none -D man3dir=none'

Well, valgrind thinks that it looks like this​:

$ valgrind ./perl -Ilib -E ' "\xe0" =~ /(?i​:\w)/'
==69701== Memcheck, a memory error detector
==69701== Copyright (C) 2002-2011, and GNU GPL'd, by Julian Seward et al.
==69701== Using Valgrind-3.7.0 and LibVEX; rerun with -h for copyright info
==69701== Command​: ./perl -Ilib -E \ "\\xe0"\ =~\ /(?i​:\\w)/
==69701==
--69701-- ./perl​:
--69701-- dSYM directory is missing; consider using --dsymutil=yes
==69701== Invalid read of size 8
==69701== at 0x7FFFFFE00A71​: ???
==69701== by 0x100201C0E​: __inline_memmove_chk (in ./perl)
==69701== by 0x100215B31​: Perl_sv_setpvn (in ./perl)
==69701== by 0x100260BED​: Perl_newSVpv (in ./perl)
==69701== by 0x10004D793​: S_init_postdump_symbols (in ./perl)
==69701== by 0x10004E825​: S_parse_body (in ./perl)
==69701== by 0x10004BDD9​: perl_parse (in ./perl)
==69701== by 0x100001403​: main (in ./perl)
==69701== Address 0x100817e08 is 136 bytes inside a block of size 142 alloc'd
==69701== at 0x1004B95CF​: malloc (vg_replace_malloc.c​:266)
==69701== by 0x100160060​: Perl_safesysmalloc (in ./perl)
==69701== by 0x10016B4BA​: Perl_my_setenv (in ./perl)
==69701== by 0x10004B989​: perl_parse (in ./perl)
==69701== by 0x100001403​: main (in ./perl)
==69701==
==69701==
==69701== HEAP SUMMARY​:

That has some potential for mischief, doesn't it?

Nicholas Clark

@p5pRT
Copy link
Author

p5pRT commented Feb 21, 2012

From @rjbs

* Karl Williamson <public@​khwilliamson.com> [2012-02-20T13​:42​:35]

On 02/20/2012 11​:21 AM, karl williamson (via RT) wrote​:

Here is a patch for this bug, that was just spotted by Yves. This
bug has existed in all of 5.14.

Should this go into 5.16?

I lean toward being in favor of it. Other opinions? Espcially objections?

--
rjbs

@p5pRT
Copy link
Author

p5pRT commented Feb 21, 2012

From @cpansprout

On Mon Feb 20 18​:17​:29 2012, perl.p5p@​rjbs.manxome.org wrote​:

* Karl Williamson <public@​khwilliamson.com> [2012-02-20T13​:42​:35]

On 02/20/2012 11​:21 AM, karl williamson (via RT) wrote​:

Here is a patch for this bug, that was just spotted by Yves. This
bug has existed in all of 5.14.

Should this go into 5.16?

I lean toward being in favor of it. Other opinions? Espcially
objections?

I think any kind of memory corruption or leak should be exempt from ‘no
user-visible changes’, even if it is visible.

--

Father Chrysostomos

@p5pRT
Copy link
Author

p5pRT commented Feb 23, 2012

From @khwilliamson

On 02/20/2012 12​:38 PM, Karl Williamson wrote​:

On 02/20/2012 11​:46 AM, Nicholas Clark wrote​:

On Mon, Feb 20, 2012 at 11​:42​:35AM -0700, Karl Williamson wrote​:

On 02/20/2012 11​:21 AM, karl williamson (via RT) wrote​:

Here is a patch for this bug, that was just spotted by Yves. This bug
has existed in all of 5.14.

Should this go into 5.16?

-----------------------------------------------------------------
Setting flags in a regular expression using the (?foo​:...) notation
loses any passed in character set. for example through the -E flag.
So,
perl -E ' "\xe0" =~ /(?i​:\w)/'
fails because the ?i​: destroys the memory that the -E was used, which
should have forced Unicode semantics on the Latin1 character \xe0.

That was a poor choice of wording on my part. What I meant was that the
regex ignores the -E flag, and parsed the regex as if it had a /d
modifier. Similarly for a 'use re /a'; the /a gets ignored.

That means the issue raised below is from some other cause, and it
doesn't show up on my system either with blead or blead+patch, in any of
the configurations I normally build with. For example
config_args='-des -Dprefix=/home/khw/fastbleadperl -Dusedevel
-Dman1dir=none -D man3dir=none'

Well, valgrind thinks that it looks like this​:

$ valgrind ./perl -Ilib -E ' "\xe0" =~ /(?i​:\w)/'
==69701== Memcheck, a memory error detector
==69701== Copyright (C) 2002-2011, and GNU GPL'd, by Julian Seward et al.
==69701== Using Valgrind-3.7.0 and LibVEX; rerun with -h for copyright
info
==69701== Command​: ./perl -Ilib -E \ "\\xe0"\ =~\ /(?i​:\\w)/
==69701==
--69701-- ./perl​:
--69701-- dSYM directory is missing; consider using --dsymutil=yes
==69701== Invalid read of size 8
==69701== at 0x7FFFFFE00A71​: ???
==69701== by 0x100201C0E​: __inline_memmove_chk (in ./perl)
==69701== by 0x100215B31​: Perl_sv_setpvn (in ./perl)
==69701== by 0x100260BED​: Perl_newSVpv (in ./perl)
==69701== by 0x10004D793​: S_init_postdump_symbols (in ./perl)
==69701== by 0x10004E825​: S_parse_body (in ./perl)
==69701== by 0x10004BDD9​: perl_parse (in ./perl)
==69701== by 0x100001403​: main (in ./perl)
==69701== Address 0x100817e08 is 136 bytes inside a block of size 142
alloc'd
==69701== at 0x1004B95CF​: malloc (vg_replace_malloc.c​:266)
==69701== by 0x100160060​: Perl_safesysmalloc (in ./perl)
==69701== by 0x10016B4BA​: Perl_my_setenv (in ./perl)
==69701== by 0x10004B989​: perl_parse (in ./perl)
==69701== by 0x100001403​: main (in ./perl)
==69701==
==69701==
==69701== HEAP SUMMARY​:

That has some potential for mischief, doesn't it?

Nicholas Clark

Since I didn't get this to happen on my 32 bit machine, I tried on
dromedary, and also don't get it to happen. I used​:
./Configure -des -Dusedevel -O -Uusenm -DDEBUGGING
I'm wondering what configuration you used. Could this be related to
dSYM being missing (I don't know what that is, BTW)

@p5pRT
Copy link
Author

p5pRT commented Feb 29, 2012

From @khwilliamson

commit 96f5488
fixes the bug about the charset. I don't know where the valgrind report
is coming from; I'm confident that it is unrelated
--
Karl Williamson

@p5pRT
Copy link
Author

p5pRT commented Feb 29, 2012

From [Unknown Contact. See original ticket]

commit 96f5488
fixes the bug about the charset. I don't know where the valgrind report
is coming from; I'm confident that it is unrelated
--
Karl Williamson

@p5pRT
Copy link
Author

p5pRT commented Feb 29, 2012

@khwilliamson - Status changed from 'open' to 'resolved'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant