Skip to content

Assertion failure in S_find_byclass: ! is_utf8_pat #17278

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
dur-randir opened this issue Nov 8, 2019 · 2 comments
Closed

Assertion failure in S_find_byclass: ! is_utf8_pat #17278

dur-randir opened this issue Nov 8, 2019 · 2 comments

Comments

@dur-randir
Copy link
Member

This is a bug report for perl from [email protected],
generated with the help of perlbug 1.41 running under perl 5.31.6.

[Please describe your issue here]

While fuzzing perl v5.31.5-213-g9bec17d7c built with afl and run
under libdislocator, I found the following program

BEGIN{$^H=4}
$z="q!\341\200\200\341\200\200\341\200\200\340\240\200\340\240\200\340\240\200\343\200\200\343\200\200\340\240\200\340\240\200\340\240\200\341\200\200\341\200\200\341\200\200\341\200\200\341\200\200\341\200\200\341\200\200\341\200\200\341\200\200\341\200\200\341\200\200\341\200\200\341\200\200\341\200\200\341\200\200\341\200\200\341\200\200\341\200\200\341\200\200!=~m!(?^i)[\303\200]!";
utf8::decode($z);
eval$z;

to cause an assertion failure

regexec.c:2236: S_find_byclass: Assertion `! is_utf8_pat' failed.

GDB stack trace is

(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007ffff7c24535 in __GI_abort () at abort.c:79
#2  0x00007ffff7c2440f in __assert_fail_base (fmt=0x7ffff7d86ee0 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=0x555555b47a40 "! is_utf8_pat",
    file=0x555555b44a68 "regexec.c", line=2236, function=<optimized out>) at assert.c:92
#3  0x00007ffff7c32102 in __GI___assert_fail (assertion=0x555555b47a40 "! is_utf8_pat", file=0x555555b44a68 "regexec.c", line=2236,
    function=0x555555b77e70 <__PRETTY_FUNCTION__.18737> "S_find_byclass") at assert.c:101
#4  0x00005555558aaaa1 in S_find_byclass (prog=0x555555c06170, c=0x555555c072bc,
    s=0x555555bf24f0 "ကကကࠀࠀࠀ  ࠀࠀࠀ", 'က' <repeats 19 times>, strend=0x555555bf254a "", reginfo=0x7fffffffdcf0) at regexec.c:2236
#5  0x00005555558b7a86 in Perl_regexec_flags (rx=0x555555bf6ef0, stringarg=0x555555bf24f0 "ကကကࠀࠀࠀ  ࠀࠀࠀ", 'က' <repeats 19 times>,
    strend=0x555555bf254a "", strbeg=0x555555bf24f0 "ကကကࠀࠀࠀ  ࠀࠀࠀ", 'က' <repeats 19 times>, minend=0, sv=0x555555bf6f08, data=0x0,
    flags=97) at regexec.c:3732
#6  0x000055555577262d in Perl_pp_match () at pp_hot.c:3014
#7  0x0000555555713b6f in Perl_runops_debug () at dump.c:2571
#8  0x00005555555f000d in S_run_body (oldscope=1) at perl.c:2714
#9  0x00005555555ef58b in perl_run (my_perl=0x555555bda260) at perl.c:2637
#10 0x00005555555a1155 in main (argc=2, argv=0x7fffffffe1d8, env=0x7fffffffe1f0) at perlmain.c:134

This is regression between 5.28 and 5.30, bisect points to

commit b229619
Author: Karl Williamson [email protected]
Date: Tue Dec 25 22:56:48 2018 -0700

Revamp qr/[...]/ optimizations

This commit extensively changes the optimizations for ANYOF regnodes
that represent bracketed character classes.

The removal of the regex compilation pass now makes these feasible and
desirable. Compilation now tries hard to optimize an ANYOF node into
something smaller and/or faster when feasible.

Now, qr/[X]/ for any single character or POSIX class X, and any
modifiers like /d, /i, etc, should be the same as qr/X/ for the same
modifiers, unless it would require the pattern to be upgraded from
non-UTF-8 to UTF-8, unless not doing so could introduce bugs.

These changes fix some issues with multi-character /i folding.

[Please do not change anything below this line]
Flags:
category=core
severity=medium
Site configuration information for perl 5.31.6:

Configured by dur-randir at Fri Nov 8 05:18:19 MSK 2019.

Summary of my perl5 (revision 5 version 31 subversion 6) configuration:
Commit id: 1462134
Platform:
osname=darwin
osvers=13.4.0
archname=darwin-2level
uname='darwin isengard.local 13.4.0 darwin kernel version 13.4.0: mon jan 11 18:17:34 pst 2016; root:xnu-2422.115.15~1release_x86_64 x86_64 '
config_args='-de -Dusedevel -DDEBUGGING'
hint=recommended
useposix=true
d_sigaction=define
useithreads=undef
usemultiplicity=undef
use64bitint=define
use64bitall=define
uselongdouble=undef
usemymalloc=n
default_inc_excludes_dot=define
bincompat5005=undef
Compiler:
cc='cc'
ccflags ='-fno-common -DPERL_DARWIN -mmacosx-version-min=10.9 -DDEBUGGING -fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include -I/opt/local/include -DPERL_USE_SAFE_PUTENV'
optimize='-O3 -g'
cppflags='-fno-common -DPERL_DARWIN -mmacosx-version-min=10.9 -DDEBUGGING -fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include -I/opt/local/include'
ccversion=''
gccversion='4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.56)'
gccosandvers=''
intsize=4
longsize=8
ptrsize=8
doublesize=8
byteorder=12345678
doublekind=3
d_longlong=define
longlongsize=8
d_longdbl=define
longdblsize=16
longdblkind=3
ivtype='long'
ivsize=8
nvtype='double'
nvsize=8
Off_t='off_t'
lseeksize=8
alignbytes=8
prototype=define
Linker and Libraries:
ld='cc'
ldflags =' -mmacosx-version-min=10.9 -fstack-protector -L/usr/local/lib -L/opt/local/lib'
libpth=/usr/local/lib /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../lib/clang/6.0/lib /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/lib /usr/lib /opt/local/lib
libs=-lpthread -lgdbm -ldbm -ldl -lm -lutil -lc
perllibs=-lpthread -ldl -lm -lutil -lc
libc=
so=dylib
useshrplib=false
libperl=libperl.a
gnulibc_version=''
Dynamic Linking:
dlsrc=dl_dlopen.xs
dlext=bundle
d_dlsymun=undef
ccdlflags=' '
cccdlflags=' '
lddlflags=' -mmacosx-version-min=10.9 -bundle -undefined dynamic_lookup -L/usr/local/lib -L/opt/local/lib -fstack-protector'

@inc for perl 5.31.6:
lib
/usr/local/lib/perl5/site_perl/5.31.6/darwin-2level
/usr/local/lib/perl5/site_perl/5.31.6
/usr/local/lib/perl5/5.31.6/darwin-2level
/usr/local/lib/perl5/5.31.6

Environment for perl 5.31.6:
DYLD_LIBRARY_PATH (unset)
HOME=/Users/dur-randir
LANG=en_US.UTF-8
LANGUAGE (unset)
LC_CTYPE=en_US.UTF-8
LD_LIBRARY_PATH (unset)
LOGDIR (unset)
PATH=/Users/dur-randir/perlbrew/bin:/Users/dur-randir/perlbrew/perls/perl-5.26.0/bin:/opt/local/bin:/usr/texbin:/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin:/Library/TeX/texbin
PERLBREW_HOME=/Users/dur-randir/.perlbrew
PERLBREW_MANPATH=/Users/dur-randir/perlbrew/perls/perl-5.26.0/man
PERLBREW_PATH=/Users/dur-randir/perlbrew/bin:/Users/dur-randir/perlbrew/perls/perl-5.26.0/bin
PERLBREW_PERL=perl-5.26.0
PERLBREW_ROOT=/Users/dur-randir/perlbrew
PERLBREW_SHELLRC_VERSION=0.86
PERLBREW_VERSION=0.86
PERL_BADLANG (unset)
SHELL=/opt/local/bin/zsh

@khwilliamson
Copy link
Contributor

Actually, this is a deeper problem than the blamed commit, which merely exposed the issue.

perlre says about the (?...) construct

Starting in Perl 5.14, a C<"^"> (caret or circumflex accent) immediately
after the C<"?"> is a shorthand equivalent to C.

The case in this ticket ends up using this construct, but is encoded in UTF-8. The code that does the matching does not expect that a UTF-8 pattern would do /d type matching. It would be extra work to do this, and I don't think it is worth it, given how long this problem has taken to surface.

I would like to change perlre and the compilation code to have a UTF-8 encoded pattern be encoded as /u when the caret is encountered instead of /d.

Opinions?

@khwilliamson
Copy link
Contributor

In thinking about this some more, I'm unsure what the caret should mean. At the time this was added, the caret was supposed to signify a fresh start, which is why this particular character was chosen, because it comes at the start typically in patterns.

The test case in this issue actually had the pattern effectively be in the scope of 'use locale'. Should the caret mean the fresh start be what's in effect outside the pattern. What if instead of locale, it was 'use re "/a"'? Or should the caret mean what's in effect for the whole pattern base on things like 'use locale', but also the specific pattern trailing modifiers.

To me it seems like the caret should merely reset whatever modifiers have been changed within the pattern to what's in effect at the start of its compilation

khwilliamson added a commit that referenced this issue Oct 23, 2020
This was an assertion failure in regexec.c under rare circumstances.  A
reduction of the fuzzed test case is now in pat_advanced.t

The root cause of this was that the pattern being compiled was encoded in
UTF-8 and 'use locale' was in effect, equivalent to the /l charset, and
then the charset was reset inside the pattern, to /d.  But /d in a UTF-8
patterns is illegal, hence the later assertion failure.

The solution is to reset instead to /u when the pattern is UTF-8.
steve-m-hay pushed a commit that referenced this issue Dec 26, 2020
This was an assertion failure in regexec.c under rare circumstances.  A
reduction of the fuzzed test case is now in pat_advanced.t

The root cause of this was that the pattern being compiled was encoded in
UTF-8 and 'use locale' was in effect, equivalent to the /l charset, and
then the charset was reset inside the pattern, to /d.  But /d in a UTF-8
patterns is illegal, hence the later assertion failure.

The solution is to reset instead to /u when the pattern is UTF-8.

(cherry picked from commit bb58640)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants