Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[regex] backref problem with quantified groups #8267

Open
p5pRT opened this issue Jan 2, 2006 · 19 comments · May be fixed by #20677
Open

[regex] backref problem with quantified groups #8267

p5pRT opened this issue Jan 2, 2006 · 19 comments · May be fixed by #20677

Comments

@p5pRT
Copy link

p5pRT commented Jan 2, 2006

Migrated from rt.perl.org#38133 (status was 'open')

Searchable as RT38133$

@p5pRT
Copy link
Author

p5pRT commented Nov 27, 2002

From edi@agharta.de

Created by edi@agharta.de

  edi@​bird​:~ > perl -e 'use Data​::Dumper; "a" =~ /((a)*)*/; print Dumper $1, $2'
  $VAR1 = '';
  $VAR2 = undef;
  edi@​bird​:~ > perl -e 'use Data​::Dumper; "a" =~ /(((a))*)*/; print Dumper $1, $2'
  $VAR1 = '';
  $VAR2 = 'a';

Obviously, $2 should either be undef or 'a' in _both_ cases. I think
we see this due to wrong optimizations and have posted a more detailed
analysis to comp.lang.perl.misc​:

  <http​://groups.google.com/groups?selm=87zns15gal.fsf%40bird.agharta.de&rnum=7>

Perl Info

Flags:
    category=core
    severity=medium

Site configuration information for perl v5.8.0:

Configured by edi at Wed Nov 13 01:41:22 CET 2002.

Summary of my perl5 (revision 5.0 version 8 subversion 0) configuration:
  Platform:
    osname=linux, osvers=2.4.19-gentoo-r5, archname=i686-linux-thread-multi
    uname='linux bird.agharta.de 2.4.19-gentoo-r5 #5 wed aug 7 13:06:53 cest 2002 i686 genuineintel '
    config_args=''
    hint=recommended, useposix=true, d_sigaction=define
    usethreads=define use5005threads=undef useithreads=define usemultiplicity=define
    useperlio=define d_sfio=undef uselargefiles=define usesocks=undef
    use64bitint=undef use64bitall=undef uselongdouble=undef
    usemymalloc=n, bincompat5005=undef
  Compiler:
    cc='cc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -fno-strict-aliasing -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
    optimize='-O3',
    cppflags='-D_REENTRANT -D_GNU_SOURCE -fno-strict-aliasing -I/usr/local/include'
    ccversion='', gccversion='2.95.3 20010315 (release)', gccosandvers=''
    intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
    ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
    alignbytes=4, prototype=define
  Linker and Libraries:
    ld='cc', ldflags =' -L/usr/local/lib'
    libpth=/usr/local/lib /lib /usr/lib
    libs=-lnsl -lndbm -lgdbm -ldb -ldl -lm -lpthread -lc -lcrypt -lutil
    perllibs=-lnsl -ldl -lm -lpthread -lc -lcrypt -lutil
    libc=/lib/libc-2.2.5.so, so=so, useshrplib=false, libperl=libperl.a
    gnulibc_version='2.2.5'
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-rdynamic'
    cccdlflags='-fpic', lddlflags='-shared -L/usr/local/lib'

Locally applied patches:
    


@INC for perl v5.8.0:
    /usr/lib/site_perl/5.6.1
    /opt/perl-5.8/lib/5.8.0/i686-linux-thread-multi
    /opt/perl-5.8/lib/5.8.0
    /opt/perl-5.8/lib/site_perl/5.8.0/i686-linux-thread-multi
    /opt/perl-5.8/lib/site_perl/5.8.0
    /opt/perl-5.8/lib/site_perl
    .


Environment for perl v5.8.0:
    HOME=/home/edi
    LANG (unset)
    LANGUAGE (unset)
    LD_LIBRARY_PATH=/usr/local/lib:
    LOGDIR (unset)
    PATH=/usr/local/bin:/home/edi/.bin:/opt/opera/bin:/opt/scl/bin:/usr/kde/3/bin:/bin:/usr/bin:/usr/local/bin:/opt/Acrobat5:/opt/opera/bin:/opt/RealPlayer8:/usr/X11R6/bin:/opt/sun-jdk-1.4.0/bin:/opt/sun-jdk-1.4.0/jre/bin:/usr/qt/3/bin:/usr/kde/3/bin
    PERL5LIB=/usr/lib/site_perl/5.6.1
    PERL_BADLANG (unset)
    SHELL=/bin/bash

@p5pRT
Copy link
Author

p5pRT commented Dec 1, 2002

From @andk

On 27 Nov 2002 09​:18​:54 -0000, "edi@​agharta.de (via RT)" <perlbug@​perl.org> said​:

  > # New Ticket Created by edi@​agharta.de
  > # Please include the string​: [perl #18708]
  > # in the subject line of all future correspondence about this issue.
  > # <URL​: http​://rt.perl.org/rt2/Ticket/Display.html?id=18708 >

  > This is a bug report for perl from edi@​agharta.de,
  > generated with the help of perlbug 1.34 running under perl v5.8.0.

  > -----------------------------------------------------------------
  > [Please enter your report here]

  > edi@​bird​:~ > perl -e 'use Data​::Dumper; "a" =~ /((a)*)*/; print Dumper $1, $2'
  > $VAR1 = '';
  > $VAR2 = undef;

Archaeological findings about this bug...

It was introduced to the trunk with patch 6373.

The bug was also integrated into 5.6.1 with patch 7772. (Note​: 7772
only compiles if 7799 is also integrated.)

Simply undoing the regexec.c part of that patch fixes the bug but also
breaks test 860 in the test suite​:

not ok 860 () ^(a(b)?)+$​:aba​:y​:-$1-$2-​:-a-- => `-a-b-', match=1

The patch I tried was​:

#### DO NOT APPLY ####

Inline Patch
--- perl-5.8.0@18217/regexec.c	Fri Nov 29 21:38:04 2002
+++ perl-5.8.0@18217-ak/regexec.c	Sun Dec  1 18:31:08 2002
@@ -293,8 +293,6 @@
 	    PL_regstartp[paren] = HOPc(input, -1) - PL_bostr;	\
 	    PL_regendp[paren] = input - PL_bostr;		\
 	}							\
-	else							\
-	    PL_regendp[paren] = -1;				\
     }								\
     if (regmatch(next))						\
 	sayYES;							\


Hope that helps somebody else to find a solution, \-\- andreas

@p5pRT
Copy link
Author

p5pRT commented Jan 3, 2003

From @hvds

andreas.koenig@​anima.de (Andreas J. Koenig) wrote​:
:>>>>> On 27 Nov 2002 09​:18​:54 -0000, "edi@​agharta.de (via RT)" <perlbug@​perl.org> said​:
: > edi@​bird​:~ > perl -e 'use Data​::Dumper; "a" =~ /((a)*)*/; print Dumper $1, $2'
: > $VAR1 = '';
: > $VAR2 = undef;
:
:Archaeological findings about this bug...
:
:It was introduced to the trunk with patch 6373.
:
:The bug was also integrated into 5.6.1 with patch 7772. (Note​: 7772
:only compiles if 7799 is also integrated.)

Digging a bit further, the actual patch was submitted (by me) in the
discussion on bug #20000701.002​:
  http​://www.xray.mpe.mpg.de/mailing-lists/perl5-porters/2000-07/msg00514.html

The difference between the two test cases is that for /((a)*)*/, the
inner paren gets optimised to CURLYN; for /(((a))*)*/ it stays as
CURLYX. I suspect that there is something lacking from that patch for
the CURLYN branch, but I haven't yet got a fix.

Hugo

@p5pRT
Copy link
Author

p5pRT commented Jan 2, 2006

From eric.niebler@gmail.com

Created by eric.niebler@gmail.com

This is a bug report for perl from eric.niebler@​gmail.com,
generated with the help of perlbug 1.35 running under perl v5.8.7.

-----------------------------------------------------------------
Consider the following program​:

  $str = 'aaA';
  $str =~ /(((?​:a))?)+/i;
  if(defined($2)) { print "$2"; }
  else { print "not defined"; }

This prints "not defined," and I think that's right.
But if I change the regex to /(((a))?)+/i (that is, if
I change the third group from non-capturing to capturing),
the program prints "A".

I can't think of a reason why changing group 3 from
non-capturing to capturing should have any effect on
whether group 2 captures anything. Seems like a regex
bug to me.

Perl Info

Flags:
    category=core
    severity=medium

Site configuration information for perl v5.8.7:

Configured by builder at Wed Nov  2 08:44:18 2005.

Summary of my perl5 (revision 5 version 8 subversion 7) configuration:
  Platform:
    osname=MSWin32, osvers=5.0, archname=MSWin32-x86-multi-thread
    uname=''
    config_args='undef'
    hint=recommended, useposix=true, d_sigaction=undef
    usethreads=define use5005threads=undef useithreads=define 
usemultiplicity=define
    useperlio=define d_sfio=undef uselargefiles=define usesocks=undef
    use64bitint=undef use64bitall=undef uselongdouble=undef
    usemymalloc=n, bincompat5005=undef
  Compiler:
    cc='cl', ccflags ='-nologo -Gf -W3 -MD -Zi -DNDEBUG -O1 -DWIN32 
-D_CONSOLE -DNO_STRICT -DHAVE_DES_FCRYPT -DNO_HASH_SEED 
-DUSE_SITECUSTOMIZE -DPERL_IMPLICIT_CONTEXT -DPERL_IMPLICIT_SYS 
-DUSE_PERLIO -DPERL_MSVCRT_READFIX',
    optimize='-MD -Zi -DNDEBUG -O1',
    cppflags='-DWIN32'
    ccversion='12.00.8804', gccversion='', gccosandvers=''
    intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
    d_longlong=undef, longlongsize=8, d_longdbl=define, longdblsize=10
    ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='__int64', 
lseeksize=8
    alignbytes=8, prototype=define
  Linker and Libraries:
    ld='link', ldflags ='-nologo -nodefaultlib -debug -opt:ref,icf  
-libpath:"C:\Perl\lib\CORE"  -machine:x86'
    libpth=\lib
    libs=  oldnames.lib kernel32.lib user32.lib gdi32.lib winspool.lib  
comdlg32.lib advapi32.lib shell32.lib ole32.lib oleaut32.lib  
netapi32.lib uuid.lib ws2_32.lib mpr.lib winmm.lib  version.lib 
odbc32.lib odbccp32.lib msvcrt.lib
    perllibs=  oldnames.lib kernel32.lib user32.lib gdi32.lib 
winspool.lib  comdlg32.lib advapi32.lib shell32.lib ole32.lib 
oleaut32.lib  netapi32.lib uuid.lib ws2_32.lib mpr.lib winmm.lib  
version.lib odbc32.lib odbccp32.lib msvcrt.lib
    libc=msvcrt.lib, so=dll, useshrplib=yes, libperl=perl58.lib
    gnulibc_version='undef'
  Dynamic Linking:
    dlsrc=dl_win32.xs, dlext=dll, d_dlsymun=undef, ccdlflags=' '
    cccdlflags=' ', lddlflags='-dll -nologo -nodefaultlib -debug 
-opt:ref,icf  -libpath:"C:\Perl\lib\CORE"  -machine:x86'

Locally applied patches:
    ACTIVEPERL_LOCAL_PATCHES_ENTRY
    Iin_load_module moved for compatibility with build 806
    Avoid signal flag SA_RESTART for older versions of HP-UX
    PerlEx support in CGI::Carp
    Less verbose ExtUtils::Install and Pod::Find
    instmodsh upgraded from ExtUtils-MakeMaker-6.25
    Patch for CAN-2005-0448 from Debian with modifications
    Upgrade to Time-HiRes-1.76
    25774 Keys of %INC always use forward slashes
    25747 Accidental interpolation of $@ in Pod::Html
    25362 File::Path::mkpath resets errno
    25181 Incorrect (X)HTML generated by Pod::Html
    24999 Avoid redefinition warning for MinGW
    24699 ICMP_UNREACHABLE handling in Net::Ping
    21540 Fix backward-compatibility issues in if.pm


@INC for perl v5.8.7:
    C:/Perl/lib
    C:/Perl/site/lib
    .


Environment for perl v5.8.7:
    HOME=C:\DOCUME~1\\ericne
    LANG (unset)
    LANGUAGE (unset)
    LD_LIBRARY_PATH (unset)
    LOGDIR (unset)
    PATH=C:\Program Files\Microsoft Visual Studio .NET 
2003\Common7\IDE;C:\Program Files\Microsoft Visual Studio .NET 
2003\VC7\BIN;C:\Program Files\Microsoft Visual Studio .NET 
2003\Common7\Tools;C:\Program Files\Microsoft Visual Studio .NET 
2003\Common7\Tools\bin\prerelease;C:\Program Files\Microsoft Visual 
Studio .NET 2003\Common7\Tools\bin;C:\Program Files\Microsoft Visual 
Studio .NET 
2003\SDK\v1.1\bin;C:\WINDOWS\Microsoft.NET\Framework\v1.1.4322;C:\Program 
Files\libxml;C:\Perl\bin\;C:\Program Files\Windows Resource 
Kits\Tools\;C:\Python23\.;C:\WINDOWS\system32;C:\WINDOWS;C:\WINDOWS\System32\Wbem;C:\Program 
Files\Perforce;C:\Program Files\doxygen\bin;C:\Program Files\Debugging 
Tools for 
Windows;C:\WINDOWS\idw;C:\cygwin\home\ericne\boost\cvs\boost\tools\build\jam_src\bin.ntx86
    PERL_BADLANG (unset)
    SHELL (unset)


@p5pRT
Copy link
Author

p5pRT commented Jan 3, 2006

From @hvds

Eric Niebler (via RT) <perlbug-followup@​perl.org> wrote​:
:Consider the following program​:
:
: $str = 'aaA';
: $str =~ /(((?​:a))?)+/i;
: if(defined($2)) { print "$2"; }
: else { print "not defined"; }
:
:This prints "not defined," and I think that's right.
:But if I change the regex to /(((a))?)+/i (that is, if
:I change the third group from non-capturing to capturing),
:the program prints "A".
:
:I can't think of a reason why changing group 3 from
:non-capturing to capturing should have any effect on
:whether group 2 captures anything. Seems like a regex
:bug to me.

I agree the inconsistency smells like a bug, though it isn't clear
to me which variant exhibits it - both results seem reasonable in
the absence of the other.

-Dr output shows that the two regexps are optimised differently​:
with /(((?​:a))?)+/, the $2 loop is optimised to CURLYN, but with
/(((a))?)+/ the interior is too complex for the optimisation to
occur (which may itself be, if not a bug, an optimisation wart)
so it remains as CURLYX. Presumably it is in the differing
implementation of CURLYN and CURLYX that the difference arises,
but this isn't something I have time to look into right now.

The results may be reasonable nonetheless - we could in principle
stick with "the $<n> variables will contain the last thing
successfully matched", while adding that "optional zero-length
submatches (that don't affect success or failure of the match as
a whole) may be elided by the optimiser". Which I suspect is what
we're getting, even though the evidence is that the less optimised
variant is the one doing the eliding.

Hugo

@p5pRT
Copy link
Author

p5pRT commented Jan 3, 2006

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Jan 3, 2006

From @Abigail

On Tue, Jan 03, 2006 at 04​:27​:40AM +0000, hv@​crypt.org wrote​:

Eric Niebler (via RT) <perlbug-followup@​perl.org> wrote​:
:Consider the following program​:
:
: $str = 'aaA';
: $str =~ /(((?​:a))?)+/i;
: if(defined($2)) { print "$2"; }
: else { print "not defined"; }
:
:This prints "not defined," and I think that's right.
:But if I change the regex to /(((a))?)+/i (that is, if
:I change the third group from non-capturing to capturing),
:the program prints "A".
:
:I can't think of a reason why changing group 3 from
:non-capturing to capturing should have any effect on
:whether group 2 captures anything. Seems like a regex
:bug to me.

I agree the inconsistency smells like a bug, though it isn't clear
to me which variant exhibits it - both results seem reasonable in
the absence of the other.

-Dr output shows that the two regexps are optimised differently​:
with /(((?​:a))?)+/, the $2 loop is optimised to CURLYN, but with
/(((a))?)+/ the interior is too complex for the optimisation to
occur (which may itself be, if not a bug, an optimisation wart)
so it remains as CURLYX. Presumably it is in the differing
implementation of CURLYN and CURLYX that the difference arises,
but this isn't something I have time to look into right now.

The results may be reasonable nonetheless - we could in principle
stick with "the $<n> variables will contain the last thing
successfully matched", while adding that "optional zero-length
submatches (that don't affect success or failure of the match as
a whole) may be elided by the optimiser". Which I suspect is what
we're getting, even though the evidence is that the less optimised
variant is the one doing the eliding.

The following program suggests that in both regexes, the outer set of
parenthesis are matched four times​:

  #!/usr/bin/perl

  use strict;
  use warnings;
  no warnings 'syntax';

  $_ = 'aaA';
  my ($i, $j);
  /(((?​:a))?(?{ $i ++; print "$i​: $2\n" }))+/i;
  /(((a))?(?{ $j ++; print "$j​: $2\n" }))+/i;
 
  __END__

  1​: a
  2​: a
  3​: A
  Use of uninitialized value in concatenation (.) or string at (re_eval 1) line 1.
  4​:
  1​: a
  2​: a
  3​: A
  4​: A

Abigail

@p5pRT
Copy link
Author

p5pRT commented Jan 3, 2006

From @ysth

On Tue, Jan 03, 2006 at 04​:27​:40AM +0000, hv@​crypt.org wrote​:

Eric Niebler (via RT) <perlbug-followup@​perl.org> wrote​:
:Consider the following program​:
:
: $str = 'aaA';
: $str =~ /(((?​:a))?)+/i;
: if(defined($2)) { print "$2"; }
: else { print "not defined"; }
:
:This prints "not defined," and I think that's right.
:But if I change the regex to /(((a))?)+/i (that is, if
:I change the third group from non-capturing to capturing),
:the program prints "A".
:
:I can't think of a reason why changing group 3 from
:non-capturing to capturing should have any effect on
:whether group 2 captures anything. Seems like a regex
:bug to me.

I agree the inconsistency smells like a bug, though it isn't clear
to me which variant exhibits it - both results seem reasonable in
the absence of the other.

To me, it seems clear that on the last iteration of the +, the ?
should match zero times, so $2 would be "" with the ?​: and undefined
without the ?​:.

@p5pRT
Copy link
Author

p5pRT commented Jan 4, 2006

From @Abigail

On Tue, Jan 03, 2006 at 03​:59​:19PM -0800, Yitzchak Scott-Thoennes wrote​:

On Tue, Jan 03, 2006 at 04​:27​:40AM +0000, hv@​crypt.org wrote​:

Eric Niebler (via RT) <perlbug-followup@​perl.org> wrote​:
:Consider the following program​:
:
: $str = 'aaA';
: $str =~ /(((?​:a))?)+/i;
: if(defined($2)) { print "$2"; }
: else { print "not defined"; }
:
:This prints "not defined," and I think that's right.
:But if I change the regex to /(((a))?)+/i (that is, if
:I change the third group from non-capturing to capturing),
:the program prints "A".
:
:I can't think of a reason why changing group 3 from
:non-capturing to capturing should have any effect on
:whether group 2 captures anything. Seems like a regex
:bug to me.

I agree the inconsistency smells like a bug, though it isn't clear
to me which variant exhibits it - both results seem reasonable in
the absence of the other.

To me, it seems clear that on the last iteration of the +, the ?
should match zero times, so $2 would be "" with the ?​: and undefined
without the ?​:.

That I don't understand. Since the ?​: controls whether or not there's
a $3, why should the value of $2 be different?

Abigail

@p5pRT
Copy link
Author

p5pRT commented Jan 4, 2006

From @ysth

On Wed, Jan 04, 2006 at 09​:48​:14AM +0100, Abigail wrote​:

On Tue, Jan 03, 2006 at 03​:59​:19PM -0800, Yitzchak Scott-Thoennes wrote​:

On Tue, Jan 03, 2006 at 04​:27​:40AM +0000, hv@​crypt.org wrote​:

Eric Niebler (via RT) <perlbug-followup@​perl.org> wrote​:
:Consider the following program​:
:
: $str = 'aaA';
: $str =~ /(((?​:a))?)+/i;
: if(defined($2)) { print "$2"; }
: else { print "not defined"; }
:
:This prints "not defined," and I think that's right.
:But if I change the regex to /(((a))?)+/i (that is, if
:I change the third group from non-capturing to capturing),
:the program prints "A".
:
:I can't think of a reason why changing group 3 from
:non-capturing to capturing should have any effect on
:whether group 2 captures anything. Seems like a regex
:bug to me.

I agree the inconsistency smells like a bug, though it isn't clear
to me which variant exhibits it - both results seem reasonable in
the absence of the other.

To me, it seems clear that on the last iteration of the +, the ?
should match zero times, so $2 would be "" with the ?​: and undefined
without the ?​:.

That I don't understand. Since the ?​: controls whether or not there's
a $3, why should the value of $2 be different?

Sorry, I was somehow assigning numbers from the inside out instead of
left to right. It should be undef in either case.

@p5pRT
Copy link
Author

p5pRT commented Jan 7, 2006

From eric.niebler@gmail.com

Yitzchak Scott-Thoennes wrote​:

On Wed, Jan 04, 2006 at 09​:48​:14AM +0100, Abigail wrote​:

On Tue, Jan 03, 2006 at 03​:59​:19PM -0800, Yitzchak Scott-Thoennes wrote​:

On Tue, Jan 03, 2006 at 04​:27​:40AM +0000, hv@​crypt.org wrote​:

Eric Niebler (via RT) <perlbug-followup@​perl.org> wrote​:
​:Consider the following program​:
​:
​: $str = 'aaA';
​: $str =~ /(((?​:a))?)+/i;
​: if(defined($2)) { print "$2"; }
​: else { print "not defined"; }
​:
​:This prints "not defined," and I think that's right.
​:But if I change the regex to /(((a))?)+/i (that is, if
​:I change the third group from non-capturing to capturing),
​:the program prints "A".
​:
​:I can't think of a reason why changing group 3 from
​:non-capturing to capturing should have any effect on
​:whether group 2 captures anything. Seems like a regex
​:bug to me.

I agree the inconsistency smells like a bug, though it isn't clear
to me which variant exhibits it - both results seem reasonable in
the absence of the other.

To me, it seems clear that on the last iteration of the +, the ?
should match zero times, so $2 would be "" with the ?​: and undefined
without the ?​:.

That I don't understand. Since the ?​: controls whether or not there's
a $3, why should the value of $2 be different?

Sorry, I was somehow assigning numbers from the inside out instead of
left to right. It should be undef in either case.

There appears to be general agreement that this is a bug. But will it
get fixed? What happens next? (Sorry, I'm not familiar with this process.)

Eric

@p5pRT
Copy link
Author

p5pRT commented Jan 9, 2006

From @hvds

Eric Niebler <eric.niebler@​gmail.com> wrote​:
[...]
:>>>>Eric Niebler (via RT) <perlbug-followup@​perl.org> wrote​:
:>>>>​:Consider the following program​:
:>>>>​:
:>>>>​: $str = 'aaA';
:>>>>​: $str =~ /(((?​:a))?)+/i;
:>>>>​: if(defined($2)) { print "$2"; }
:>>>>​: else { print "not defined"; }
:>>>>​:
:>>>>​:This prints "not defined," and I think that's right.
:>>>>​:But if I change the regex to /(((a))?)+/i (that is, if
:>>>>​:I change the third group from non-capturing to capturing),
:>>>>​:the program prints "A".
[...]
:There appears to be general agreement that this is a bug. But will it
:get fixed? What happens next? (Sorry, I'm not familiar with this process.)

Now it waits until someone simultaneously acquires the time, ability and
desire to locate the bug; once located, it may be found to be anything
from easy to impossible to develop a fix that doesn't break anything else.

If a fix is developed it will go into the "bleeding edge" codebase first,
which is the one working towards v5.10 of perl; if it is stable there
and does not appear to have a wider impact it will likely also be
incorporated into the maintenance track used to deliver v5.8.x releases.

But there are few people with the knowledge to debug problems in the
regexp engine, and they tend to have limited time available, so the
first step may take a while.

Hugo

@p5pRT
Copy link
Author

p5pRT commented Jun 24, 2007

From @cpansprout

perl -MData​::Dumper -le '"aba" =~ /^(a(b)?)+$/; print Dumper $1, $2;'
$VAR1 = 'a';
$VAR2 = undef;

This is the case because the outer + makes the subexpression
containing the second pair of capturing parentheses match twice. The
second time through, (b) does not participate in the match, so $2 is
undef (this coincides with ECMAScript's behaviour).

But if I change (b) to (b+) or ((b)), the behaviour changes​:

perl -MData​::Dumper -le '"aba" =~ /^(a(b+)?)+$/; print Dumper $1, $2;'
$VAR1 = 'a';
$VAR2 = 'b';

perl -MData​::Dumper -le '"aba" =~ /^(a((b))?)+$/; print Dumper $1, $2;'
$VAR1 = 'a';
$VAR2 = 'b';

(Though this probably makes no difference, if this is to be made
consistent, I think I prefer the former behaviour [!defined $2]).

This is the case both with 5.8.8 and 5.9.5 #31441.

$s = "Juusstt aannootthheerr Peerrll hhaacckkeerr,\n";
$s =~ s/(?​:((?<!$_)$_)?){2}(?​:((?<!$_$_)$_+)?){2}/$1$2/g for 'a' .. 'z';
print $s;


Flags​:
  category=
  severity=


Site configuration information for perl v5.8.8​:

Configured by neo at Tue Jan 9 16​:06​:53 PST 2007.

Summary of my perl5 (revision 5 version 8 subversion 8) configuration​:
  Platform​:
  osname=darwin, osvers=8.8.0, archname=darwin-thread-multi-2level
  uname='darwin treebeard.local 8.8.0 darwin kernel version 8.8.0​:
fri sep 8 17​:18​:57 pdt 2006; root​:xnu-792.12.6.obj~1release_ppc power
macintosh powerpc '
  config_args=''
  hint=recommended, useposix=true, d_sigaction=define
  usethreads=define use5005threads=undef useithreads=define
usemultiplicity=define
  useperlio=define d_sfio=undef uselargefiles=define usesocks=undef
  use64bitint=undef use64bitall=undef uselongdouble=undef
  usemymalloc=n, bincompat5005=undef
  Compiler​:
  cc='cc', ccflags ='-g -pipe -fno-common -DPERL_DARWIN -no-cpp-
precomp -fno-strict-aliasing -I/usr/local/include',
  optimize='-O3',
  cppflags='-no-cpp-precomp -g -pipe -fno-common -DPERL_DARWIN -no-
cpp-precomp -fno-strict-aliasing -I/usr/local/include'
  ccversion='', gccversion='4.0.0 20041026 (Apple Computer, Inc.
build 4061)', gccosandvers='darwin8'
  intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=4321
  d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16
  ivtype='long', ivsize=4, nvtype='double', nvsize=8,
Off_t='off_t', lseeksize=8
  alignbytes=8, prototype=define
  Linker and Libraries​:
  ld='env MACOSX_DEPLOYMENT_TARGET=10.3 cc', ldflags =' -L/usr/
local/lib'
  libpth=/usr/local/lib /usr/lib
  libs=-ldbm -ldl -lm -lc
  perllibs=-ldl -lm -lc
  libc=, so=dylib, useshrplib=false, libperl=libperl.a
  gnulibc_version=''
  Dynamic Linking​:
  dlsrc=dl_dlopen.xs, dlext=bundle, d_dlsymun=undef, ccdlflags=' '
  cccdlflags=' ', lddlflags=' -bundle -undefined dynamic_lookup -L/
usr/local/lib'

Locally applied patches​:


@​INC for perl v5.8.8​:
  /usr/local/lib/perl5/5.8.8/darwin-thread-multi-2level
  /usr/local/lib/perl5/5.8.8
  /usr/local/lib/perl5/site_perl/5.8.8/darwin-thread-multi-2level
  /usr/local/lib/perl5/site_perl/5.8.8
  /usr/local/lib/perl5/site_perl
  /System/Library/Perl/5.8.6/darwin-thread-multi-2level
  /System/Library/Perl/5.8.6/darwin-thread-multi-2level
  /System/Library/Perl/5.8.6
  /Library/Perl/5.8.6/darwin-thread-multi-2level
  /Library/Perl/5.8.6/darwin-thread-multi-2level
  /Library/Perl/5.8.6
  /Library/Perl
  /Network/Library/Perl/5.8.6/darwin-thread-multi-2level
  /Network/Library/Perl/5.8.6
  /Network/Library/Perl
  /System/Library/Perl/Extras/5.8.6/darwin-thread-multi-2level
  /System/Library/Perl/Extras/5.8.6/darwin-thread-multi-2level
  /System/Library/Perl/Extras/5.8.6
  /Library/Perl/5.8.1
  .


Environment for perl v5.8.8​:
  DYLD_LIBRARY_PATH (unset)
  HOME=/Users/neo
  LANG (unset)
  LANGUAGE (unset)
  LD_LIBRARY_PATH (unset)
  LOGDIR (unset)
  PATH=/bin​:/sbin​:/usr/bin​:/usr/sbin​:/usr/TeX/bin/powerpc-
darwin6.8​:/usr/local/bin
  PERL_BADLANG (unset)
  SHELL=/bin/bash

@p5pRT
Copy link
Author

p5pRT commented May 24, 2008

From p5p@spam.wizbit.be

Attached is a patch with a todo test for this bug report.

Summary of the report​:

#!/usr/bin/perl -l

if ("A" =~ /(((?​:A))?)+/) {
  print "\$1 = $1, \$2 = $2, \$3 = $3"
}

if ("A" =~ /(((A))?)+/) {
  print "\$1 = $1, \$2 = $2, \$3 = $3";
}
__END__
Output​:

$1 = , $2 = , $3 =
$1 = , $2 = A, $3 = A

The value of the second capture group depends on wheter or not there is a
third capturing group.

The value should be the same in both cases.

(For more info look at RT)

@p5pRT
Copy link
Author

p5pRT commented May 24, 2008

From p5p@spam.wizbit.be

Inline Patch
--- old/t/op/pat.t	2008-05-24 23:15:39.000000000 +0200
+++ new/t/op/pat.t	2008-05-24 23:16:15.000000000 +0200
@@ -4642,6 +4642,17 @@
     iseq( join('', @isPunctLatin1), '', 
 	'IsPunct agrees with [:punct:] with explicit Latin1');
 } 
+{
+  local $TODO = "[perl #38133]";
+
+  "A" =~ /(((?:A))?)+/;
+  my $first = $2;
+
+  "A" =~ /(((A))?)+/;
+  my $second = $2;
+
+  iseq($first, $second);
+}
 
 
 # Test counter is at bottom of file. Put new tests above here.
@@ -4705,7 +4716,7 @@
 
 # Don't forget to update this!
 BEGIN {
-    $::TestCount = 4035;
+    $::TestCount = 4036;
     print "1..$::TestCount\n";
 }
 

@p5pRT
Copy link
Author

p5pRT commented May 24, 2008

From [Unknown Contact. See original ticket]

Attached is a patch with a todo test for this bug report.

Summary of the report​:

#!/usr/bin/perl -l

if ("A" =~ /(((?​:A))?)+/) {
  print "\$1 = $1, \$2 = $2, \$3 = $3"
}

if ("A" =~ /(((A))?)+/) {
  print "\$1 = $1, \$2 = $2, \$3 = $3";
}
__END__
Output​:

$1 = , $2 = , $3 =
$1 = , $2 = A, $3 = A

The value of the second capture group depends on wheter or not there is a
third capturing group.

The value should be the same in both cases.

(For more info look at RT)

@p5pRT
Copy link
Author

p5pRT commented Dec 12, 2010

From @khwilliamson

Commit 72aa120
adds the attached todo .t patch to re/pat.t
--Karl Williamson

@demerphq
Copy link
Collaborator

demerphq commented Jan 6, 2023

see also #19615

@demerphq demerphq linked a pull request Jan 7, 2023 that will close this issue
@demerphq
Copy link
Collaborator

demerphq commented Jan 7, 2023

This is fixed in #20677

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants