-
Notifications
You must be signed in to change notification settings - Fork 577
regex performance: braces { } slower than star * #14855
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
From @maukeCreated by @mauke#!/usr/bin/perl use Benchmark qw(cmpthese); my $target = 'abcdefg' x 1_000; cmpthese -3, { __END__ The only difference between the regexes is that one uses * notation to express | Rate brace star If I'm reading this right, * runs 3500 times faster than { } for no good Perl Info
|
From [email protected]* l.mai@web.de (perlbug-followup@perl.org) [150814 15:35]:
Interesting... the brace version shows quadratic behavior 'abcdefg' x 1000 (on my system) 'abcdefg' x 500 (on my system) Rate brace star 'abcdefg' x 250 (on my system) Rate brace star So, halving the size of the pattern, make .* run double as fast, MarkOv Mark Overmeer MSc MARKOV Solutions |
The RT System itself - Status changed from 'new' to 'open' |
From @AbigailOn Fri, Aug 14, 2015 at 05:43:00PM +0200, Mark Overmeer wrote:
There's a shortcut in the case of /.*/. In the /.*/ case, the regexp engine only tries to match the string Not so in the /.{0,}/ case. Then it will restart trying to match, Also note, /.*/ is NOT the same as /.{0,}/; at least, not internally. I do agree with the OP though that there should not be a user Abigail |
From @iabynOn Fri, Aug 14, 2015 at 06:36:33PM +0200, Abigail wrote:
/.*XXX/ is a silly thing to do. Perl spots this and optimises it. -- |
From @maukeOn Fri Aug 14 10:26:40 2015, davem wrote:
I've attached a patch attempt that fixes the reported issue. It doesn't seem to break any core tests. However, I haven't actually looked at the bigger context of the code. Someone who's seen S_regpiece before should review this - I suspect the goto logic can be simplified. |
From @mauke0001-PoC-125810.patchFrom 844df4809f00fd6c6d03369af59f3c4730ddd34d Mon Sep 17 00:00:00 2001
From: Lukas Mai <[email protected]>
Date: Fri, 14 Aug 2015 21:21:38 +0200
Subject: [PATCH] PoC [#125810]
---
regcomp.c | 28 +++++++++++++++-------------
1 file changed, 15 insertions(+), 13 deletions(-)
diff --git a/regcomp.c b/regcomp.c
index f08f08f..3b15a07 100644
--- a/regcomp.c
+++ b/regcomp.c
@@ -10893,6 +10893,20 @@ S_regpiece(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth)
do_curly:
if ((flags&SIMPLE)) {
+ if (min == 0 && max == REG_INFTY) {
+ reginsert(pRExC_state, STAR, ret, depth+1);
+ ret->flags = 0;
+ MARK_NAUGHTY(4);
+ RExC_seen |= REG_UNBOUNDED_QUANTIFIER_SEEN;
+ goto nest_check;
+ }
+ if (min == 1 && max == REG_INFTY) {
+ reginsert(pRExC_state, PLUS, ret, depth+1);
+ ret->flags = 0;
+ MARK_NAUGHTY(3);
+ RExC_seen |= REG_UNBOUNDED_QUANTIFIER_SEEN;
+ goto nest_check;
+ }
MARK_NAUGHTY_EXP(2, 2);
reginsert(pRExC_state, CURLY, ret, depth+1);
Set_Node_Offset(ret, parse_start+1); /* MJD */
@@ -10966,22 +10980,10 @@ S_regpiece(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth)
*flagp = (op != '+') ? (WORST|SPSTART|HASWIDTH) : (WORST|HASWIDTH);
- if (op == '*' && (flags&SIMPLE)) {
- reginsert(pRExC_state, STAR, ret, depth+1);
- ret->flags = 0;
- MARK_NAUGHTY(4);
- RExC_seen |= REG_UNBOUNDED_QUANTIFIER_SEEN;
- }
- else if (op == '*') {
+ if (op == '*') {
min = 0;
goto do_curly;
}
- else if (op == '+' && (flags&SIMPLE)) {
- reginsert(pRExC_state, PLUS, ret, depth+1);
- ret->flags = 0;
- MARK_NAUGHTY(3);
- RExC_seen |= REG_UNBOUNDED_QUANTIFIER_SEEN;
- }
else if (op == '+') {
min = 1;
goto do_curly;
--
2.5.0
|
From @AbigailOn Fri, Aug 14, 2015 at 06:26:05PM +0100, Dave Mitchell wrote:
Why is /.*XXX/ a silly thing to do?
I don't know, but it would be nice if * really was just a shortcut for {0,}. And although I don't know how Perl spots things and optimizes it, it seems I had never realized before that {0,} could be slower than *. I've to Abigail |
From @janduboisOn Fri, Aug 14, 2015 at 1:37 PM, Abigail <abigail@abigail.be> wrote:
Because the .* doesn't serve any useful purpose. It's like writing You rely on the computer to optimize it away, but there is no reason How about /^.*XXX.*$/ for even more silliness. :) Cheers, |
From @AbigailOn Fri, Aug 14, 2015 at 02:13:49PM -0700, Jan Dubois wrote:
Except that it isn't. .* isn't a noop; it forces the XXX part to match at the last possible There is the same performance difference in /(.*)[xy]/ vs /(.{0,})[xy]/.
That's something different as well. Abigail |
From [email protected]Dave Mitchell writes:
So what is the non-silly thing to "greedily match anything, but only so Regards, Factory and User Sound Singles for Waldorf Q+, Q and microQ: |
From @khwilliamsonOn 08/14/2015 01:25 PM, l.mai@web.de via RT wrote:
Your patch looks good to me. It's now smoking as smoke-me/khw-star (v5.23.1-197-gcfa19b0): |
From @ikegamiOn Fri, Aug 14, 2015 at 4:37 PM, Abigail <abigail@abigail.be> wrote:
Because it's effectively /^.*?.*XXX/ it should have been written /^.*XXX/ |
From @khwilliamsonOn 08/15/2015 01:49 PM, Karl Williamson wrote:
I will apply your patch, but it needs a better commit message than "PoC" |
From @maukeOn Sun Aug 23 21:28:04 2015, public@khwilliamson.com wrote:
I've applied it myself in 02853b7. |
From @arcAbigail <abigail@abigail.be> wrote:
This is somewhat at a tangent to the rest of this thread, but as of -- |
@mauke - Status changed from 'open' to 'resolved' |
Migrated from rt.perl.org#125810 (status was 'resolved')
Searchable as RT125810$
The text was updated successfully, but these errors were encountered: