Skip to content

Regex optimizer fails with mixture of EXACT, and EXACTFU #12295

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
p5pRT opened this issue Jul 27, 2012 · 7 comments
Closed

Regex optimizer fails with mixture of EXACT, and EXACTFU #12295

p5pRT opened this issue Jul 27, 2012 · 7 comments

Comments

@p5pRT
Copy link

p5pRT commented Jul 27, 2012

Migrated from rt.perl.org#114282 (status was 'resolved')

Searchable as RT114282$

@p5pRT
Copy link
Author

p5pRT commented Jul 27, 2012

From @khwilliamson

This is a bug report for perl from khw@​karl.(none),
generated with the help of perlbug 1.39 running under perl 5.17.3.


This fails​:
perl -Mre=Debug,COMPILE,EXECUTE -le ' my $c = "\x{0130}_"; my $p =
qr/(?u​:((?i​:\x{0049}\x{0307}),?)_)/; print $c =~ $p;'

It should succeed because the fold of x{130} is \x49\x{307}, and the
comma is optional. It compiles correctly, but we get the following
output​:

floating utf8 "_" at 2..3 (checking floating) stclass EXACTFU <i\x{307}>
minlen 3
r->extflags​: USE_INTUIT_NOML USE_INTUIT_ML UNICODE
Guessing start of match in sv for REx
"(?u​:((?i​:\x{0049}\x{0307}),?)_)" against "%x{130}_"
UTF-8 pattern and string...
Did not find floating substr "_"...
Match rejected by optimizer

It succeeds if the underscore is changed into "(?i​:_)". This means the
combination of EXACT and EXACTFish nodes causes the problem.



Flags​:
  category=core
  severity=medium


Site configuration information for perl 5.17.3​:

Configured by khw at Thu Jul 26 08​:28​:09 MDT 2012.

Summary of my perl5 (revision 5 version 17 subversion 3) configuration​:
  Commit id​: 43cd5cb
  Platform​:
  osname=linux, osvers=2.6.35-32-generic-pae,
archname=i686-linux-thread-multi-64int-ld
  uname='linux karl 2.6.35-32-generic-pae #67-ubuntu smp mon mar 5
21​:23​:19 utc 2012 i686 gnulinux '
  config_args='-des -Dprefix=/home/khw/blead -Dusedevel
-D'optimize=-ggdb3' -A'optimize=-ggdb3' -A'optimize=-O0' -Dman1dir=none
-Dman3dir=none -DDEBUGGING -Dcc=g++ -Dusemorebits -Dusethreads'
  hint=recommended, useposix=true, d_sigaction=define
  useithreads=define, usemultiplicity=define
  useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
  use64bitint=define, use64bitall=undef, uselongdouble=define
  usemymalloc=n, bincompat5005=undef
  Compiler​:
  cc='g++', ccflags ='-D_REENTRANT -D_GNU_SOURCE -DDEBUGGING
-fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include
-D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
  optimize='-O0 -ggdb3',
  cppflags='-D_REENTRANT -D_GNU_SOURCE -DDEBUGGING
-fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include'
  ccversion='', gccversion='4.4.5', gccosandvers=''
  intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=12345678
  d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
  ivtype='long long', ivsize=8, nvtype='long double', nvsize=12,
Off_t='off_t', lseeksize=8
  alignbytes=4, prototype=define
  Linker and Libraries​:
  ld='g++', ldflags =' -fstack-protector -L/usr/local/lib'
  libpth=/usr/local/lib /lib/../lib /usr/lib/../lib /lib /usr/lib
/usr/lib/i686-linux-gnu
  libs=-lnsl -ldl -lm -lcrypt -lutil -lpthread -lc
  perllibs=-lnsl -ldl -lm -lcrypt -lutil -lpthread -lc
  libc=/lib/../lib/libc.so.6, so=so, useshrplib=false, libperl=libperl.a
  gnulibc_version='2.12'
  Dynamic Linking​:
  dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E'
  cccdlflags='-fPIC', lddlflags='-shared -ggdb3 -ggdb3 -O0
-L/usr/local/lib -fstack-protector'

Locally applied patches​:


@​INC for perl 5.17.3​:

/home/khw/blead/lib/perl5/site_perl/5.17.3/i686-linux-thread-multi-64int-ld
  /home/khw/blead/lib/perl5/site_perl/5.17.3
  /home/khw/blead/lib/perl5/5.17.3/i686-linux-thread-multi-64int-ld
  /home/khw/blead/lib/perl5/5.17.3
  /home/khw/blead/lib/perl5/site_perl
  .


Environment for perl 5.17.3​:
  HOME=/home/khw
  LANG=en_US.UTF-8
  LANGUAGE=en_US​:en
  LD_LIBRARY_PATH (unset)
  LOGDIR (unset)

PATH=/home/khw/bin​:/home/khw/print/bin​:/bin​:/usr/local/sbin​:/usr/local/bin​:/usr/sbin​:/usr/bin​:/sbin​:/usr/games​:/home/khw/cxoffice/bin
  PERL5OPT=-w
  PERL_BADLANG (unset)
  SHELL=/bin/ksh

@p5pRT
Copy link
Author

p5pRT commented Jul 28, 2012

From @khwilliamson

I have done some more research into this, as follows. A simpler example
that fails is this one​:

blead -Mre=Debug,All -E 'say "\N{LATIN SMALL LIGATURE FF}_" =~ /(?i​:ff)_/'

That yields​:
Compiling REx "(?i​:ff)_"
Final program​:
  1​: EXACTFU <ff> (3)
  3​: EXACT <_> (5)
  5​: END (0)
anchored "_" at 2 (checking anchored) stclass EXACTFU <ff> minlen 3
Matching REx "(?i​:ff)_" against "%x{fb00}_"
UTF-8 string...
Did not find anchored substr "_"...
Match failed

The regex optimizer, split between regcomp.c and regexec.c, does not
take into account the possibility of multi-character folds in these
situations. I was wrong to say that the original compiled correctly.
The problem is that the pattern can match either 2 or 3 characters, and
the regcomp optimizer doesn't tell this to the regexec optimiser. This
part​:

anchored "_" at 2 (checking anchored) stclass EXACTFU <ff> minlen 3

is wrong because the "2" could either be 1 or 2 characters, so should be
range. In the example in the original report, the "2..3" should have
been "1..3". If it is changed by hand in the debugger to 1..3, the
match properly succeeds.

The obvious solution would be to change the regcomp.c portion of the
optimizer to detect multi-character folds in EXACTFish nodes, and to
increase the range passed to regexec accordingly. However, I don't see
a real efficient way to do this.

The best method I've come up with so far along those lines is to go
through every EXACTFish node looking for multi-char folded sequences in
regcomp.c's join_exact(). A hash would be constructed at initialization
whose keys are the UTF-8 of all the 49 characters (currently) which
begin multi-character folds. Every character in the node would be
looked up in that hash. The hash values would form a chain to find all
the multi-char fold sequences that begin with the key.

--
Karl Williamson

@p5pRT
Copy link
Author

p5pRT commented Jul 28, 2012

From [Unknown Contact. See original ticket]

I have done some more research into this, as follows. A simpler example
that fails is this one​:

blead -Mre=Debug,All -E 'say "\N{LATIN SMALL LIGATURE FF}_" =~ /(?i​:ff)_/'

That yields​:
Compiling REx "(?i​:ff)_"
Final program​:
  1​: EXACTFU <ff> (3)
  3​: EXACT <_> (5)
  5​: END (0)
anchored "_" at 2 (checking anchored) stclass EXACTFU <ff> minlen 3
Matching REx "(?i​:ff)_" against "%x{fb00}_"
UTF-8 string...
Did not find anchored substr "_"...
Match failed

The regex optimizer, split between regcomp.c and regexec.c, does not
take into account the possibility of multi-character folds in these
situations. I was wrong to say that the original compiled correctly.
The problem is that the pattern can match either 2 or 3 characters, and
the regcomp optimizer doesn't tell this to the regexec optimiser. This
part​:

anchored "_" at 2 (checking anchored) stclass EXACTFU <ff> minlen 3

is wrong because the "2" could either be 1 or 2 characters, so should be
range. In the example in the original report, the "2..3" should have
been "1..3". If it is changed by hand in the debugger to 1..3, the
match properly succeeds.

The obvious solution would be to change the regcomp.c portion of the
optimizer to detect multi-character folds in EXACTFish nodes, and to
increase the range passed to regexec accordingly. However, I don't see
a real efficient way to do this.

The best method I've come up with so far along those lines is to go
through every EXACTFish node looking for multi-char folded sequences in
regcomp.c's join_exact(). A hash would be constructed at initialization
whose keys are the UTF-8 of all the 49 characters (currently) which
begin multi-character folds. Every character in the node would be
looked up in that hash. The hash values would form a chain to find all
the multi-char fold sequences that begin with the key.

--
Karl Williamson

@p5pRT
Copy link
Author

p5pRT commented Jul 28, 2012

@khwilliamson - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Nov 14, 2012

From @khwilliamson

This was fixed mainly by commit 0a982f0
and was completely fixed by the time
3465e1f was done
--
Karl Williamson

@p5pRT
Copy link
Author

p5pRT commented Nov 14, 2012

From [Unknown Contact. See original ticket]

This was fixed mainly by commit 0a982f0
and was completely fixed by the time
3465e1f was done
--
Karl Williamson

@p5pRT
Copy link
Author

p5pRT commented Nov 14, 2012

@khwilliamson - Status changed from 'open' to 'resolved'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant