Skip to content

[win32] Both IO and global match are VERY SLOW since 5.12 and up to 5.38 unless pre-heated #21654

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
LorenzoTa opened this issue Nov 21, 2023 · 9 comments · Fixed by #21664
Closed

Comments

@LorenzoTa
Copy link

Hello,

in this thread at perlmonks an anonymous contributor spotted a severe performance drop in perl under windows for both IO and global match.

I have tested the code against different strawberry distros in my answer and seems something starting getting worst after 5.12 and still present in 5.38

Pre-heating (predimensionate the array) seems to mitigate the issue after 5.26

Someone in the chat supposed to be something related to malloc on win32.

I'm not in the position to dig it up further.

Thanks for looking and, if something is found, I'll take care to update the above thread.

L*

@LorenzoTa LorenzoTa changed the title [win32] Both IO and global match are VERY SLOW. Unless pre-heated [win32] Both IO and global match are VERY SLOW since 5.12 and up to 5.38 unless pre-heated Nov 21, 2023
@mauke
Copy link
Contributor

mauke commented Nov 21, 2023

Copied from the link:

I did "physical file IO", not "scalar IO", to better demonstrate how it can affect (and always did?) the very basic daily operations. "Preheat" immediately disposes of allocated array, it's not well-known "pre-set $#array or hash size for better performance".

use strict;
use warnings;
use feature 'say';
use Time::HiRes 'time';

if ( !@ARGV ) {
    say $^V;
    for my $size ( 5e5, 1e6, 5e6 ) {
        say "Array size: $size";
        for my $method ( 0, 1, 2 ) {
            for my $heat ( 0, 1 ) {
                system $^X, $0, $method, $heat, $size
            }
        }
    }
}
else {
    my ( $method, $preheat, $size ) = @ARGV;

    my $s = join ' ', 0 .. 9;
    $s .= "\n";
    $s x= $size;
    chomp $s;
    
    my $t = time;
    
    if ( $preheat ) {
        my @garbage = ( undef ) x $size
    }
    
    my @a;
    if ( $method == 0 ) {               # split
        @a = split /(?<=\n)/, $s
    }
    elsif ( $method == 1 ) {            # global match
        @a = $s =~ /(.*?\n|.+)/gs
    }
    elsif ( $method == 2 ) {            # IO (list context)
        open my $fh, '>', 'garbage.tmp';
        binmode $fh;
        print $fh $s;
        close $fh;
        open $fh, '<', 'garbage.tmp';
        binmode $fh;
        @a = <$fh>;
        close $fh;
    }
    
    printf "\t%s, %s:\t%.3f\n", 
        ( <split match io(list)> )[ $method ],
        ( $preheat ? 'pre-heat' : 'no pre-heat' ),
        time - $t
}

Result:

v5.38.0
Array size: 500000
        split, no pre-heat:     0.627
        split, pre-heat:        0.631
        match, no pre-heat:     0.855
        match, pre-heat:        0.292
        io(list), no pre-heat:  0.893
        io(list), pre-heat:     0.274
Array size: 1000000
        split, no pre-heat:     1.272
        split, pre-heat:        1.286
        match, no pre-heat:     3.604
        match, pre-heat:        0.583
        io(list), no pre-heat:  3.498
        io(list), pre-heat:     0.556
Array size: 5000000
        split, no pre-heat:     6.356
        split, pre-heat:        6.346
        match, no pre-heat:     79.586
        match, pre-heat:        2.885
        io(list), no pre-heat:  84.150
        io(list), pre-heat:     2.744

@hvds
Copy link
Contributor

hvds commented Nov 21, 2023

I'll just observe that "preheat" only touches malloc, not the array:

    if ( $preheat ) {
        my @garbage = ( undef ) x $size
    }

So it seems the problem must be in malloc or our use of it.

@richardleach
Copy link
Contributor

Looking at @LorenzoTa's results in the PerlMonks thread, notwithstanding the pre-heating behaviour, it looks like significant performance differences between 32 & 64 bit builds. For example, the 32-bit and 64-bit no pre-heat results below look consistent across perl versions:

 | v5.28.0
 | Array size: 5000000
 |      split, no pre-heat:     4.599
 |      match, no pre-heat:     20.415
 |      io(list), no pre-heat:  20.736
[OK] \perl5.28.32bit\perl\bin\perl.exe

 | v5.28.1
 | Array size: 5000000
 |      split, no pre-heat:     4.033
 |      match, no pre-heat:     78.298
 |      io(list), no pre-heat:  79.609
[OK] \perl5.28-64bit\perl\bin\perl.exe

@tonycoz
Copy link
Contributor

tonycoz commented Nov 22, 2023

I'll just observe that "preheat" only touches malloc, not the array:

    if ( $preheat ) {
        my @garbage = ( undef ) x $size
    }

So it seems the problem must be in malloc or our use of it.

realloc() on Windows of large blocks is slow relative to Linux and eg. FreeBSD, since those can use mremap() to move such blocks cheaply.

This code would EXTEND() the stack and prevent all those reallocs when incrementally growing the stack.

@demerphq
Copy link
Collaborator

This code would EXTEND() the stack and prevent all those reallocs when incrementally growing the stack.

IIRC the stack starts off quite small, we could easily change it to be much larger at startup. (Note to casual readers, this is one of the Perl stacks, not the C stack)

@tonycoz
Copy link
Contributor

tonycoz commented Nov 22, 2023

A slower realloc() (O(size of allocation) vs ~O(1) for large sizes) explains why it's slower on Windows than Linux, but doesn't explain the apparent slow down over releases.

From what I can see EXTEND() ends up calling av_extend() which via av_extend_guts() exponentially increases the stack by 20% each time which hasn't changed since 5.000 so I wouldn't expect the growth in time we see here.

The ratios between the times:

Count Time Count ratio vs base Time ratio vs base
50000 0.893 1.0 1.0
1000000 3.498 2.0 3.917
5000000 84.15 10.0 94.233

do imply we have $O(n^2)$ growth here, so I've likely missed something.

If we are seeing $O(n^2)$ growth here then changing the initial size of the stack won't make much difference unless we make it very large, since most of the cost is at the larger sizes.

I plan to do some tests today.

@tonycoz
Copy link
Contributor

tonycoz commented Nov 23, 2023

It does seem to spend most of the time in realloc():
image
image

@tonycoz
Copy link
Contributor

tonycoz commented Nov 23, 2023

And it is the tmps stack, which grows linearly:

SSize_t
Perl_tmps_grow_p(pTHX_ SSize_t ix)
{
    SSize_t extend_to = ix;
#ifndef STRESS_REALLOC
    if (ix - PL_tmps_max < 128)
        extend_to += (PL_tmps_max < 512) ? 128 : 512;
#endif
    Renew(PL_tmps_stack, extend_to + 1, SV*);
    PL_tmps_max = extend_to + 1;
    return ix;
}

After making tmps growth exponential:

v5.39.6
Array size: 500000
        split, no pre-heat:     1.723
        split, pre-heat:        1.751
        match, no pre-heat:     0.694
        match, pre-heat:        0.630
        io(list), no pre-heat:  0.455
        io(list), pre-heat:     0.418
Array size: 1000000
        split, no pre-heat:     3.523
        split, pre-heat:        3.477
        match, no pre-heat:     1.266
        match, pre-heat:        1.261
        io(list), no pre-heat:  0.821
        io(list), pre-heat:     0.843
Array size: 5000000
        split, no pre-heat:     17.279
        split, pre-heat:        17.279
        match, no pre-heat:     6.217
        match, pre-heat:        6.306
        io(list), no pre-heat:  4.126
        io(list), pre-heat:     4.184

tonycoz added a commit to tonycoz/perl5 that referenced this issue Nov 23, 2023
As with the value stack and the save stack, this gives us constant
amortized growth per element.

After this patch the profiler shows the "SvPV_shrink_to_cur(sv)"
and "sv = sv_2mortal(newSV(80))" calls in do_readline as the
hotspots for the io unheated test case, using 55% of the measured
time in total.

Fixes Perl#21654
@tonycoz
Copy link
Contributor

tonycoz commented Nov 23, 2023

With the patch the hotspot becomes:
image
Saving and restoring error/Win32 last error is a remarkable percentage of the time used for malloc():
image
image
and realloc():
image
image
image

tonycoz added a commit to tonycoz/perl5 that referenced this issue Nov 23, 2023
As with the value stack and the save stack, this gives us constant
amortized growth per element.

After this patch the profiler shows the "SvPV_shrink_to_cur(sv)"
and "sv = sv_2mortal(newSV(80))" calls in do_readline as the
hotspots for the io unheated test case, using 55% of the measured
time in total.

Fixes Perl#21654
tonycoz added a commit to tonycoz/perl5 that referenced this issue Nov 27, 2023
As with the value stack and the save stack, this gives us constant
amortized growth per element.

After this patch the profiler shows the "SvPV_shrink_to_cur(sv)"
and "sv = sv_2mortal(newSV(80))" calls in do_readline as the
hotspots for the io unheated test case, using 55% of the measured
time in total.

Fixes Perl#21654
tonycoz added a commit to tonycoz/perl5 that referenced this issue Nov 29, 2023
As with the value stack and the save stack, this gives us constant
amortized growth per element.

After this patch the profiler shows the "SvPV_shrink_to_cur(sv)"
and "sv = sv_2mortal(newSV(80))" calls in do_readline as the
hotspots for the io unheated test case, using 55% of the measured
time in total.

Fixes Perl#21654
tonycoz added a commit that referenced this issue Dec 4, 2023
As with the value stack and the save stack, this gives us constant
amortized growth per element.

After this patch the profiler shows the "SvPV_shrink_to_cur(sv)"
and "sv = sv_2mortal(newSV(80))" calls in do_readline as the
hotspots for the io unheated test case, using 55% of the measured
time in total.

Fixes #21654
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants