-
Notifications
You must be signed in to change notification settings - Fork 581
regular expressions matching lines read from an in-memory scalar is extremely slow in cygwin/MSYS2 perls #21877
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Related PerlMonks discussion: https://www.perlmonks.org/?node_id=11157166 |
@dxter-zz Any chance you could modify your script and rerun it to separately measure the time taken for:
and
Right now, the script's timing output conflates the two operations and its not clear how time taken is distributed across the latter two while loops. e.g. Could this be a memory allocation/copy slowdown, rather than something in the regex engine? |
I'm not sure I get your meaning, but populating $s takes 0.14 seconds on my machine.
|
I suspect part of the problem is the SvLEN() for The long SvLEN() values appear to be from this code in sv_gets():
|
Profiling shows sv_setsv_cow() as the hot function, but if I do line-by-line profiling the two hottest lines appear to be:
which doesn't make much sense, CowREFCNT(sv) is pretty much But I don't expect sv_setsv_cow() to be expensive in any case. That CowREFCNT() macro is one of the few places referencing SvLEN() (aka All I can think of is Windows deciding to commit (probably filling with zeroes) all of the large allocation done earlier by sv_gets() when that last byte is accessed. |
In case it wasn't clear and if it's helpful, the large time difference only seems to occur when there are lines in the file that match the regex. |
It was clear and useful: the regexp engine (conditionally) calls sv_setsv_cow() on a successful match to save the string for later use in reporting matched text, eg with At this point I see three issues:
I don't know that we can do anything about 3., the generated assembler seems reasonable:
but I suspect fixing the other two will mitigate it. COWing the large SVs causes problem outside of this case too, the SV buffer was propagated to other copies, While debugging I tried saving the SVs to an array, which resulted in:
2 could be simply fixed by calling SvPV_shrink_to_cur()[1] or something similar at the end of sv_gets() but this is both an extra call to realloc() and if the same SV doesn't get COWed, will result in extra calls to re-expand the buffer on the next sv_gets(). So I think a more complex fix is needed. I suspect the fix for 1 is to make SvCANCOW() check that the length isn't excessive in comparison to the string size, though we will need to define "excessive" for this case. [1] SvPV_shrink_to_cur() disables COW on the SV, which we don't want. |
Previously if you had a successful match against an SV with a SvLEN() large relative to the SvCUR() the regexp engine would use sv_setsv_cow() to make a COW copy of the matched SV, extending the life of the large allocation buffer. A normal sv_setsv() normally didn't do such a COW copy, but the above also marked the source SV as COW, so further copies of the SV could even further extend the lifetime of the buffer, eg: while (<>) { # readline tends to make large SvLEN() /something/; # some sort of match push @save, $_; # with a successful match, the large $_ buffer # survives until @save is released } Fixes part of Perl#21877
Previously if you had a successful match against an SV with a SvLEN() large relative to the SvCUR() the regexp engine would use sv_setsv_cow() to make a COW copy of the matched SV, extending the life of the large allocation buffer. A normal sv_setsv() normally didn't do such a COW copy, but the above also marked the source SV as COW, so further copies of the SV could even further extend the lifetime of the buffer, eg: while (<>) { # readline tends to make large SvLEN() /something/; # some sort of match push @save, $_; # with a successful match, the large $_ buffer # survives until @save is released } Fixes part of Perl#21877
Previously if you had a successful match against an SV with a SvLEN() large relative to the SvCUR() the regexp engine would use sv_setsv_cow() to make a COW copy of the matched SV, extending the life of the large allocation buffer. A normal sv_setsv() normally didn't do such a COW copy, but the above also marked the source SV as COW, so further copies of the SV could even further extend the lifetime of the buffer, eg: while (<>) { # readline tends to make large SvLEN() /something/; # some sort of match push @save, $_; # with a successful match, the large $_ buffer # survives until @save is released } Fixes part of Perl#21877
Previously if you had a successful match against an SV with a SvLEN() large relative to the SvCUR() the regexp engine would use sv_setsv_cow() to make a COW copy of the matched SV, extending the life of the large allocation buffer. A normal sv_setsv() normally didn't do such a COW copy, but the above also marked the source SV as COW, so further copies of the SV could even further extend the lifetime of the buffer, eg: while (<>) { # readline tends to make large SvLEN() /something/; # some sort of match push @save, $_; # with a successful match, the large $_ buffer # survives until @save is released } Fixes part of Perl#21877
Previously if you had a successful match against an SV with a SvLEN() large relative to the SvCUR() the regexp engine would use sv_setsv_cow() to make a COW copy of the matched SV, extending the life of the large allocation buffer. A normal sv_setsv() normally didn't do such a COW copy, but the above also marked the source SV as COW, so further copies of the SV could even further extend the lifetime of the buffer, eg: while (<>) { # readline tends to make large SvLEN() /something/; # some sort of match push @save, $_; # with a successful match, the large $_ buffer # survives until @save is released } Fixes part of Perl#21877
Previously if you had a successful match against an SV with a SvLEN() large relative to the SvCUR() the regexp engine would use sv_setsv_cow() to make a COW copy of the matched SV, extending the life of the large allocation buffer. A normal sv_setsv() normally didn't do such a COW copy, but the above also marked the source SV as COW, so further copies of the SV could even further extend the lifetime of the buffer, eg: while (<>) { # readline tends to make large SvLEN() /something/; # some sort of match push @save, $_; # with a successful match, the large $_ buffer # survives until @save is released } Fixes part of Perl#21877
Previously if you had a successful match against an SV with a SvLEN() large relative to the SvCUR() the regexp engine would use sv_setsv_cow() to make a COW copy of the matched SV, extending the life of the large allocation buffer. A normal sv_setsv() normally didn't do such a COW copy, but the above also marked the source SV as COW, so further copies of the SV could even further extend the lifetime of the buffer, eg: while (<>) { # readline tends to make large SvLEN() /something/; # some sort of match push @save, $_; # with a successful match, the large $_ buffer # survives until @save is released } Fixes part of Perl#21877
Description
With cygwin/MSYS2 perl when you do regular expression matching on lines read from an in-memory scalar and the regex matches something it takes orders of magnitude longer than matching the same lines read from a disk file.
Steps to Reproduce
Below is code to demonstrate the behavior. An example file is attached to run with the code. In this file about 16% of the lines match.
On my cygwin system this prints:
So the in-memory file is about 300 times slower.
Expected behavior
I would expect the times to be roughly in the same ball park.
Perl configuration
The text was updated successfully, but these errors were encountered: