Skip to content

problem with simple regular expression matching #7464

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
p5pRT opened this issue Aug 15, 2004 · 13 comments
Closed

problem with simple regular expression matching #7464

p5pRT opened this issue Aug 15, 2004 · 13 comments

Comments

@p5pRT
Copy link

p5pRT commented Aug 15, 2004

Migrated from rt.perl.org#31129 (status was 'resolved')

Searchable as RT31129$

@p5pRT
Copy link
Author

p5pRT commented Aug 15, 2004

From @johanngeorge

A simple regular expression match is failing. The following short program
attempt twice to match a pattern which should always succeed. The first time
it succeeds, the second time it fails. If I uncomment out the pos line, the
match succeeds. Is there something I am unaware of? I am running a vanilla
RedHat 9.0 release. I have also enclosed the information from perlbug.

  #!/usr/bin/env perl
  #
  use strict;
  use warnings;
  use diagnostics;

  my $s = "Hello";
  $s =~ /\GZ*/g or
  die "fail 1\n";

  # uncomment out next line and match succeeds.
  # pos($s) = pos($s);
  $s =~ /\GZ*/g or
  die "fail 2\n";


Flags​:
  category=
  severity=


Site configuration information for perl v5.8.0​:

Configured by bhcompile at Tue Feb 18 22​:17​:47 EST 2003.

Summary of my perl5 (revision 5.0 version 8 subversion 0) configuration​:
  Platform​:
  osname=linux, osvers=2.4.20-2.48smp, archname=i386-linux-thread-multi
  uname='linux stripples.devel.redhat.com 2.4.20-2.48smp #1 smp thu feb 13 11​:44​:55 est 2003 i686 i686 i386 gnulinux '
  config_args='-des -Doptimize=-O2 -march=i386 -mcpu=i686 -g -Dmyhostname=localhost -Dperladmin=root@​localhost -Dcc=gcc -Dcf_by=Red Hat, Inc. -Dinstallprefix=/usr -Dprefix=/usr -Darchname=i386-linux -Dvendorprefix=/usr -Dsiteprefix=/usr -Dotherlibdirs=/usr/lib/perl5/5.8.0 -Duseshrplib -Dusethreads -Duseithreads -Duselargefiles -Dd_dosuid -Dd_semctl_semun -Di_db -Ui_ndbm -Di_gdbm -Di_shadow -Di_syslog -Dman3ext=3pm -Duseperlio -Dinstallusrbinperl -Ubincompat5005 -Uversiononly -Dpager=/usr/bin/less -isr'
  hint=recommended, useposix=true, d_sigaction=define
  usethreads=define use5005threads=undef useithreads=define usemultiplicity=define
  useperlio=define d_sfio=undef uselargefiles=define usesocks=undef
  use64bitint=undef use64bitall=undef uselongdouble=undef
  usemymalloc=n, bincompat5005=undef
  Compiler​:
  cc='gcc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -DTHREADS_HAVE_PIDS -DDEBUGGING -fno-strict-aliasing -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -I/usr/include/gdbm',
  optimize='-O2 -march=i386 -mcpu=i686 -g',
  cppflags='-D_REENTRANT -D_GNU_SOURCE -DTHREADS_HAVE_PIDS -DDEBUGGING -fno-strict-aliasing -I/usr/local/include -I/usr/include/gdbm'
  ccversion='', gccversion='3.2.2 20030213 (Red Hat Linux 8.0 3.2.2-1)', gccosandvers=''
  intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
  d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
  ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
  alignbytes=4, prototype=define
  Linker and Libraries​:
  ld='gcc', ldflags =' -L/usr/local/lib'
  libpth=/usr/local/lib /lib /usr/lib
  libs=-lnsl -lgdbm -ldb -ldl -lm -lpthread -lc -lcrypt -lutil
  perllibs=-lnsl -ldl -lm -lpthread -lc -lcrypt -lutil
  libc=/lib/libc-2.3.1.so, so=so, useshrplib=true, libperl=libperl.so
  gnulibc_version='2.3.1'
  Dynamic Linking​:
  dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-rdynamic -Wl,-rpath,/usr/lib/perl5/5.8.0/i386-linux-thread-multi/CORE'
  cccdlflags='-fPIC', lddlflags='-shared -L/usr/local/lib'

Locally applied patches​:
  MAINT18379


@​INC for perl v5.8.0​:
  perlLib
  /home/johann/perlLibL
  /home/johann/perlLibG/i386-linux-thread-multi
  /home/johann/perlLibG
  /usr/lib/perl5/5.8.0/i386-linux-thread-multi
  /usr/lib/perl5/5.8.0
  /usr/lib/perl5/site_perl/5.8.0/i386-linux-thread-multi
  /usr/lib/perl5/site_perl/5.8.0
  /usr/lib/perl5/site_perl
  /usr/lib/perl5/vendor_perl/5.8.0/i386-linux-thread-multi
  /usr/lib/perl5/vendor_perl/5.8.0
  /usr/lib/perl5/vendor_perl
  /usr/lib/perl5/5.8.0/i386-linux-thread-multi
  /usr/lib/perl5/5.8.0
  .


Environment for perl v5.8.0​:
  HOME=/home/johann
  LANG=en_US.UTF-8
  LANGUAGE (unset)
  LC_ALL=C
  LD_LIBRARY_PATH (unset)
  LOGDIR (unset)
  PATH=/home/johann/eOutside​:/home/johann/eBinaries​:/home/johann/eScripts​:/usr/kerberos/bin​:/usr/local/bin​:/bin​:/usr/bin​:/usr/X11R6/bin​:./bin
  PERL5LIB=perlLib​:/home/johann/perlLibL​:/home/johann/perlLibG
  PERL_BADLANG (unset)
  SHELL=/bin/bash

@p5pRT
Copy link
Author

p5pRT commented Aug 15, 2004

From @JohnPeacock

Johann George (via RT) wrote​:

\# uncomment out next line and match succeeds\.
\# pos\($s\) = pos\($s\);
$s =~ /\\GZ\*/g or
    die "fail 2\\n";

Setting pos() while using //g is definitely double uncool (even though you are
setting it to the same value it had)​:

   pos SCALAR
   pos     Returns the offset of where the last "m//g" search left off for
           the variable in question \($\_ is used when the variable is not
           specified\)\.  May be modified to change that offset\.  Such modi\-
           fication will also influence the "\\G" zero\-width assertion in
           regular expressions\.  See perlre and perlop\.

Running the code without the assignment with the added line​:

  use re 'debug';

displays this​:

Matching REx `\GZ*' against `Hello'
Setting an EVAL scope, savestack=5
0 <> <Hello> | 1​: GPOS
0 <> <Hello> | 2​: STAR
EXACT <Z> can match 0 times out of 2147483647...
Setting an EVAL scope, savestack=5
0 <> <Hello> | 5​: END
Match successful!
Matching REx `\GZ*' against `Hello'
Setting an EVAL scope, savestack=5
0 <> <Hello> | 1​: GPOS
0 <> <Hello> | 2​: STAR
EXACT <Z> can match 0 times out of 2147483647...
Setting an EVAL scope, savestack=5
0 <> <Hello> | 5​: END
Match possible, but length=0 is smaller than requested=1, failing!
failed...
Match failed

but I'm not at all sure where the "requested=1" is coming from. FWIW, this
works fine (as far as both regexes succeed) with 5.005_03, but fails with 5.6.1
and later.

HTH

John

--
John Peacock
Director of Information Research and Technology
Rowman & Littlefield Publishing Group
4720 Boston Way
Lanham, MD 20706
301-459-3366 x.5010
fax 301-429-5747

@p5pRT
Copy link
Author

p5pRT commented Aug 15, 2004

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Aug 15, 2004

From [email protected]

John Peacock <jpeacock@​rowman.com> writes​:

0 <> <Hello> | 2​: STAR
EXACT <Z> can match 0 times out of 2147483647...
Setting an EVAL scope, savestack=5
0 <> <Hello> | 5​: END
Match possible, but length=0 is smaller than requested=1, failing!
failed...
Match failed

but I'm not at all sure where the "requested=1" is coming from.

Presumably this is a fix for zero width matching in same place multiple times.

  So

  while (/\GZ*/g) { ... }

doesn't loop forever.

FWIW, this
works fine (as far as both regexes succeed) with 5.005_03, but fails with 5.6.1
and later.

HTH

John

@p5pRT
Copy link
Author

p5pRT commented Aug 15, 2004

From @ysth

On Sun, Aug 15, 2004 at 01​:21​:23PM +0100, Nick Ing-Simmons <nick@​ing-simmons.net> wrote​:

John Peacock <jpeacock@​rowman.com> writes​:

0 <> <Hello> | 2​: STAR
EXACT <Z> can match 0 times out of 2147483647...
Setting an EVAL scope, savestack=5
0 <> <Hello> | 5​: END
Match possible, but length=0 is smaller than requested=1, failing!
failed...
Match failed

but I'm not at all sure where the "requested=1" is coming from.

Presumably this is a fix for zero width matching in same place multiple times.

So

while \(/\\GZ\*/g\) \{ \.\.\. \}

doesn't loop forever.

And is documented in perlre "Repeated patterns matching zero-length substring".

@p5pRT
Copy link
Author

p5pRT commented Aug 16, 2004

From @tamias

On Sun, Aug 15, 2004 at 08​:16​:21AM -0400, John Peacock wrote​:

Johann George (via RT) wrote​:

# uncomment out next line and match succeeds.
# pos($s) = pos($s);
$s =~ /\GZ*/g or
die "fail 2\n";

Setting pos() while using //g is definitely double uncool (even though you
are setting it to the same value it had)​:

What?! That's the whole point of pos() being an lvalue; you can set it
while using //g. There's nothing wrong with doing this.

See the perlre documentation, "Repeated patterns matching zero-length
substring", for the likely explanation for this behavior.

Ronald

@p5pRT
Copy link
Author

p5pRT commented Aug 16, 2004

From @JohnPeacock

Ronald J Kimball wrote​:

On Sun, Aug 15, 2004 at 08​:16​:21AM -0400, John Peacock wrote​:

Setting pos() while using //g is definitely double uncool (even though you
are setting it to the same value it had)​:

What?! That's the whole point of pos() being an lvalue; you can set it
while using //g. There's nothing wrong with doing this.

Sorry, I should have been a little less broad with my brush. Setting pos()
while expecting \G to do something useful in the context of //g is what I was
suggesting was bad form.

See the perlre documentation, "Repeated patterns matching zero-length
substring", for the likely explanation for this behavior.

Yes, that explains it quite fine, as others have already pointed out. I was at
pains to figure out why the OP thought that regex should always match until I
read that POD. I guess my expectation meets with the current code behavior...

John

--
John Peacock
Director of Information Research and Technology
Rowman & Littlefield Publishing Group
4720 Boston Way
Lanham, MD 20706
301-459-3366 x.5010
fax 301-429-5747

@p5pRT
Copy link
Author

p5pRT commented Aug 16, 2004

From [email protected]

On Aug 16, John Peacock said​:

Sorry, I should have been a little less broad with my brush. Setting
pos() while expecting \G to do something useful in the context of //g is
what I was suggesting was bad form.

If the docs don't state that \G is just an anchor that only matches at the
location in the string referred to by pos()'s value, the docs need
adjusting.

It might make you sick to know that

  $_ = "japhy";
  pos($_) = 2;
  print /(.\G.)/;

works and prints "ap".

--
Jeff "japhy" Pinyan % How can we ever be the sold short or
RPI Acacia Brother #734 % the cheated, we who for every service
http​://japhy.perlmonk.org/ % have long ago been overpaid?
http​://www.perlmonks.org/ % -- Meister Eckhart

@p5pRT
Copy link
Author

p5pRT commented Aug 16, 2004

From @JohnPeacock

Jeff 'japhy' Pinyan wrote​:

If the docs don't state that \G is just an anchor that only matches at the
location in the string referred to by pos()'s value, the docs need
adjusting.

Patches welcome! ;)

It might make you sick to know that

$_ = "japhy";
pos($_) = 2;
print /(.\G.)/;

works and prints "ap".

Please keep your perversions to yourself in this public forum. :0

John

--
John Peacock
Director of Information Research and Technology
Rowman & Littlefield Publishing Group
4501 Forbes Boulevard
Suite H
Lanham, MD 20706
301-459-3366 x.5010
fax 301-429-5748

@p5pRT
Copy link
Author

p5pRT commented Aug 16, 2004

@iabyn - Status changed from 'open' to 'resolved'

@p5pRT p5pRT closed this as completed Aug 16, 2004
@p5pRT
Copy link
Author

p5pRT commented Aug 16, 2004

From @johanngeorge

I am most impressed with your responsiveness. Thank you all for your
insightful comments. I just read up the section that Scott referred to
(Repeated pattern matching zero-length substring) and found the
following comment which might be relevant​:

  The higher level-loops preserve an additional state between
  iterations​: whether the last match was zero-length To break the
  loop, the following match after a zero-length match is prohibited to
  have a length of zero.

I have been using

  $str =~ /\G ... /g;
 
to parse a grammar. Each routine matches what it needs and the pos()
pointer is automatically moved along. Given this behavior, the fix I
have used is to set

  pos($str) = pos($str)

after every attempted match that might be zero-length. I gather that it
resets the "additional state" and seems to work. Is this a good idiom?
Is there a more official way to reset the "additional state"?

Thanks much.

Johann

@p5pRT
Copy link
Author

p5pRT commented Aug 16, 2004

From @JohnPeacock

Johann George wrote​:

after every attempted match that might be zero-length. I gather that it
resets the "additional state" and seems to work. Is this a good idiom?
Is there a more official way to reset the "additional state"?

As long as you don't do this in a while() loop, you should be fine.
Without context of what you were doing (parsing a grammer), there was no
way to give you better advice than "then don't do that!"

John

--
John Peacock
Director of Information Research and Technology
Rowman & Littlefield Publishing Group
4501 Forbes Boulevard
Suite H
Lanham, MD 20706
301-459-3366 x.5010
fax 301-429-5748

@p5pRT
Copy link
Author

p5pRT commented Aug 17, 2004

From [email protected]

Johann George <johann@​georgex.org> writes​:

I am most impressed with your responsiveness. Thank you all for your
insightful comments. I just read up the section that Scott referred to
(Repeated pattern matching zero-length substring) and found the
following comment which might be relevant​:

The higher level-loops preserve an additional state between
iterations​: whether the last match was zero-length To break the
loop, the following match after a zero-length match is prohibited to
have a length of zero.

I have been using

$str =~ /\G ... /g;

to parse a grammar. Each routine matches what it needs and the pos()
pointer is automatically moved along. Given this behavior, the fix I
have used is to set

pos($str) = pos($str)

after every attempted match that might be zero-length. I gather that it
resets the "additional state" and seems to work. Is this a good idiom?

In context of perl regexps yes.
But parsing a grammar that has leading zero length items
normaly leads to grammar re-write unless "look ahead"
can resolve which one it was.

Is there a more official way to reset the "additional state"?

Don't think so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant