Skip to content

Safe signals slowdown #10222

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
p5pRT opened this issue Mar 10, 2010 · 12 comments
Closed

Safe signals slowdown #10222

p5pRT opened this issue Mar 10, 2010 · 12 comments

Comments

@p5pRT
Copy link

p5pRT commented Mar 10, 2010

Migrated from rt.perl.org#73480 (status was 'resolved')

Searchable as RT73480$

@p5pRT
Copy link
Author

p5pRT commented Mar 10, 2010

From @nwc10

Perl 5.8.0 introduced "safe" signals, which changed signal handling to have
a C signal handler only set a flag, and run the Perl-space signal handlers
at the next "safe" moment. The runloop​:

int
Perl_runops_standard(pTHX)
{
  dVAR;
  while ((PL_op = CALL_FPTR(PL_op->op_ppaddr)(aTHX))) {
  PERL_ASYNC_CHECK();
  }

  TAINT_NOT;
  return 0;
}

was changed to check for and dispatch pending signals between ops, by changing
the default definition of PERL_ASYNC_CHECK() from

#ifndef PERL_ASYNC_CHECK
#define PERL_ASYNC_CHECK() NOOP
#endif

to this​:

#ifndef PERL_MICRO
# ifndef PERL_ASYNC_CHECK
# define PERL_ASYNC_CHECK() if (PL_sig_pending) despatch_signals()
# endif
#endif

#ifndef PERL_ASYNC_CHECK
# define PERL_ASYNC_CHECK() NOOP
#endif

(PERL_ASYNC_CHECK was originally added for MacOS classic. Classic is now gone)

PERL_ASYNC_CHECK is also present in the restart loops of various IO ops.

At the time this was believed to cause a slowdown, and IIRC a figure given
was 4% [citation needed :-)]. However, I might be misremembering, because
running perlbench today on blead I find that it's 4%.

(Change PERL_ASYNC_CHECK to NOOP, and force unsafe signals. Quite a few
regression tests SEGV. This is "perl5.11.5-unsafe" below)

Anyway, on IRC Reini Urban was investigating unrolling the OP loop, as part of
compiling perl, and observed that on x86 there are (IIRC) 18 assembly
instructions for the PERL_ASYNC_CHECK, but only (IIRC) 12 for the OP dispatch.
So I wondered if we could remove PERL_ASYNC_CHECK from the oploop, and put it
in relevant OPs.

IIRC some time ago Leo was working on JITting the parrot, and took the
approach of not checking for events (including signals) most of the time,
but explicitly checking on backwards jumps, because backwards jumps were
loops looping.

So I tried removing PERL_ASYNC_CHECK from the oploop, and putting into
all ops that have flow control (as in, ops that potentially can return
different values for the next op), nextstate, POP_BLOCK and at the end
of unwinding the scope stack.

The last one might be overkill, but originally I didn't have POP_BLOCK,
instead checking at start and end of unwinding the scope stack, but that
fails one test in lib/warnings.t​:

########
# mg.c
use warnings 'signal' ;
if ($^O eq 'MSWin32' || $^O eq 'NetWare' || $^O eq 'VMS') {
  print "SKIPPED\n# $^O, can't kill() to raise()\n"; exit;
}
$|=1;
$SIG{"INT"} = "fred"; kill "INT",$$;
EXPECT
SIGINT handler "fred" not defined.
########

because POB_BLOCK has already restored PL_curcop before Perl_leave_scope()
is called, so the use warnings 'signal' is no longer seen.

So, with this, signal dispatch is now (effectively) at C<;>, C<}> and any
flow control. I don't know if I've missed any constructions, such that it's
possible to use them to create an uninterruptible loop. I think I've got it
"right", in that all signal dispatching occurs in the same statement (and
scope) as the signal delivery, just in case anything "cares" (like the test
above).

But, all tests pass, and perlbench says​:

$ ./perlbench-run /home/nclark/Sandpit/snap5.9.x-v5.11.5-98-g243b158/bin/perl5.11.5-{vanilla,unsafe,safefast} /home/nclark/Sandpit/snap5.9.x-v5.11.5-98-g243b158/bin/perl5.11.5-{vanilla,safefast,unsafe}
A) perl-5.11.5
  version = 5.011005
  path = /home/nclark/Sandpit/snap5.9.x-v5.11.5-98-g243b158/bin/perl5.11.5-vanilla

B) perl-5.11.5
  version = 5.011005
  path = /home/nclark/Sandpit/snap5.9.x-v5.11.5-98-g243b158/bin/perl5.11.5-unsafe

C) perl-5.11.5
  version = 5.011005
  path = /home/nclark/Sandpit/snap5.9.x-v5.11.5-98-g243b158/bin/perl5.11.5-safefast

D) perl-5.11.5
  version = 5.011005
  path = /home/nclark/Sandpit/snap5.9.x-v5.11.5-98-g243b158/bin/perl5.11.5-vanilla

E) perl-5.11.5
  version = 5.011005
  path = /home/nclark/Sandpit/snap5.9.x-v5.11.5-98-g243b158/bin/perl5.11.5-safefast

F) perl-5.11.5
  version = 5.011005
  path = /home/nclark/Sandpit/snap5.9.x-v5.11.5-98-g243b158/bin/perl5.11.5-unsafe

  A B C D E F
  --- --- --- --- --- ---
arith/mixed 100 103 102 99 102 102
arith/trig 100 106 105 100 104 107
array/copy 100 97 100 96 99 103
array/foreach 100 103 98 100 98 93
array/index 100 73 99 101 100 96
array/pop 100 94 99 89 101 97
array/shift 100 98 101 98 102 98
array/sort-num 100 95 96 100 97 96
array/sort 100 99 100 102 99 99
call/0arg 100 97 96 100 96 98
call/1arg 100 109 94 105 98 106
call/2arg 100 95 86 102 90 94
call/9arg 100 107 102 98 99 107
call/empty 100 90 101 103 104 90
call/fib 100 95 91 92 89 99
call/method 100 103 102 103 102 105
call/wantarray 100 104 105 102 109 103
hash/copy 100 95 99 101 98 95
hash/each 100 95 99 99 98 96
hash/foreach-sort 100 103 102 99 102 105
hash/foreach 100 102 101 100 101 103
hash/get 100 103 78 99 100 100
hash/set 100 103 102 101 102 103
loop/for-c 100 112 124 100 127 112
loop/for-range-const 100 134 126 97 131 134
loop/for-range 100 122 127 100 125 131
loop/getline 100 100 94 101 97 99
loop/while-my 100 137 135 104 133 137
loop/while 100 111 135 99 136 111
re/const 100 97 103 101 101 100
re/w 100 96 96 96 99 100
startup/fewmod 100 101 100 101 101 100
startup/lotsofsub 100 101 100 101 100 100
startup/noprog 100 100 101 100 101 100
string/base64 100 98 98 100 99 97
string/htmlparser 100 100 99 99 98 101
string/index-const 100 110 114 99 114 110
string/index-var 100 120 96 100 96 121
string/ipol 100 104 111 99 111 105
string/tr 100 99 97 100 101 99

AVERAGE 100 103 103 100 104 104

  ^ unmodified blead ^
  ^ safe signals ripped out ^
  ^ this patch ^

So it looks like this approach recovers *all* the speed lost by safe
signals. (Which surprises me, as I would have expected only a partial
improvement)

Nicholas Clark

Inline Patch
diff --git a/cop.h b/cop.h
index 13ce794..0faf0b0 100644
--- a/cop.h
+++ b/cop.h
@@ -563,12 +563,13 @@ struct block {
 		    (long)cxstack_ix, PL_block_type[CxTYPE(cx)]); )
 
 /* Exit a block (RETURN and LAST). */
-#define POPBLOCK(cx,pm) cx = &cxstack[cxstack_ix--],			\
-	newsp		 = PL_stack_base + cx->blk_oldsp,		\
-	PL_curcop	 = cx->blk_oldcop,				\
-	PL_markstack_ptr = PL_markstack + cx->blk_oldmarksp,		\
-	PL_scopestack_ix = cx->blk_oldscopesp,				\
-	pm		 = cx->blk_oldpm,				\
+#define POPBLOCK(cx,pm) cx = &cxstack[cxstack_ix--];			\
+	PERL_ASYNC_CHECK();						\
+	newsp		 = PL_stack_base + cx->blk_oldsp;		\
+	PL_curcop	 = cx->blk_oldcop;				\
+	PL_markstack_ptr = PL_markstack + cx->blk_oldmarksp;		\
+	PL_scopestack_ix = cx->blk_oldscopesp;				\
+	pm		 = cx->blk_oldpm;				\
 	gimme		 = cx->blk_gimme;				\
 	DEBUG_SCOPE("POPBLOCK");					\
 	DEBUG_l( PerlIO_printf(Perl_debug_log, "Leaving block %ld, type %s\n",		\
diff --git a/dump.c b/dump.c
index bc1ba58..d1fa26e 100644
--- a/dump.c
+++ b/dump.c
@@ -2026,7 +2026,6 @@ Perl_runops_debug(pTHX)
 
     DEBUG_l(Perl_deb(aTHX_ "Entering new RUNOPS level\n"));
     do {
-	PERL_ASYNC_CHECK();
 	if (PL_debug) {
 	    if (PL_watchaddr && (*PL_watchaddr != PL_watchok))
 		PerlIO_printf(Perl_debug_log,
diff --git a/pp_ctl.c b/pp_ctl.c
index 742bc3d..9f8a6d5 100644
--- a/pp_ctl.c
+++ b/pp_ctl.c
@@ -264,6 +264,9 @@ PP(pp_substcont)
     register REGEXP * const rx = cx->sb_rx;
     SV *nsv = NULL;
     REGEXP *old = PM_GETRE(pm);
+
+    PERL_ASYNC_CHECK();
+
     if(old != rx) {
 	if(old)
 	    ReREFCNT_dec(old);
@@ -1865,6 +1868,8 @@ PP(pp_dbstate)
     PL_stack_sp = PL_stack_base + cxstack[cxstack_ix].blk_oldsp;
     FREETMPS;
 
+    PERL_ASYNC_CHECK();
+
     if (PL_op->op_flags & OPf_SPECIAL /* breakpoint */
 	    || SvIV(PL_DBsingle) || SvIV(PL_DBsignal) || SvIV(PL_DBtrace))
     {
@@ -2647,6 +2652,8 @@ PP(pp_goto)
     else
 	label = cPVOP->op_pv;
 
+    PERL_ASYNC_CHECK();
+
     if (label && *label) {
 	OP *gotoprobe = NULL;
 	bool leaving_eval = FALSE;
diff --git a/pp_hot.c b/pp_hot.c
index 3371e88..9f70147 100644
--- a/pp_hot.c
+++ b/pp_hot.c
@@ -52,6 +52,7 @@ PP(pp_nextstate)
     TAINT_NOT;		/* Each statement is presumed innocent */
     PL_stack_sp = PL_stack_base + cxstack[cxstack_ix].blk_oldsp;
     FREETMPS;
+    PERL_ASYNC_CHECK();
     return NORMAL;
 }
 
@@ -98,6 +99,7 @@ PP(pp_gv)
 PP(pp_and)
 {
     dVAR; dSP;
+    PERL_ASYNC_CHECK();
     if (!SvTRUE(TOPs))
 	RETURN;
     else {
@@ -203,6 +205,7 @@ PP(pp_sassign)
 PP(pp_cond_expr)
 {
     dVAR; dSP;
+    PERL_ASYNC_CHECK();
     if (SvTRUEx(POPs))
 	RETURNOP(cLOGOP->op_other);
     else
@@ -416,6 +419,7 @@ PP(pp_preinc)
 PP(pp_or)
 {
     dVAR; dSP;
+    PERL_ASYNC_CHECK();
     if (SvTRUE(TOPs))
 	RETURN;
     else {
@@ -434,6 +438,7 @@ PP(pp_defined)
     const bool is_dor = (op_type == OP_DOR || op_type == OP_DORASSIGN);
 
     if (is_dor) {
+	PERL_ASYNC_CHECK();
         sv = TOPs;
         if (!sv || !SvANY(sv)) {
 	    if (op_type == OP_DOR)
@@ -2071,6 +2076,8 @@ PP(pp_subst)
 #endif
     SV *nsv = NULL;
 
+    PERL_ASYNC_CHECK();
+
     /* known replacement string? */
     register SV *dstr = (pm->op_pmflags & PMf_CONST) ? POPs : NULL;
     if (PL_op->op_flags & OPf_STACKED)
diff --git a/run.c b/run.c
index be280ee..20c711a 100644
--- a/run.c
+++ b/run.c
@@ -38,7 +38,6 @@ Perl_runops_standard(pTHX)
 {
     dVAR;
     while ((PL_op = CALL_FPTR(PL_op->op_ppaddr)(aTHX))) {
-	PERL_ASYNC_CHECK();
     }
 
     TAINT_NOT;
diff --git a/scope.c b/scope.c
index ed4c835..b9b50bf 100644
--- a/scope.c
+++ b/scope.c
@@ -1113,6 +1113,8 @@ Perl_leave_scope(pTHX_ I32 base)
     }
 
     PL_tainted = was;
+
+    PERL_ASYNC_CHECK();
 }
 
 void

@p5pRT
Copy link
Author

p5pRT commented Mar 11, 2010

From @rurban

Nicholas Clark (via RT) schrieb​:

# New Ticket Created by Nicholas Clark
# Please include the string​: [perl #73480]
# in the subject line of all future correspondence about this issue.
#<URL​: http​://rt.perl.org/rt3/Ticket/Display.html?id=73480>

Perl 5.8.0 introduced "safe" signals, which changed signal handling to have
a C signal handler only set a flag, and run the Perl-space signal handlers
at the next "safe" moment. The runloop​:

int
Perl_runops_standard(pTHX)
{
dVAR;
while ((PL_op = CALL_FPTR(PL_op->op_ppaddr)(aTHX))) {
PERL_ASYNC_CHECK();
}

 TAINT\_NOT;
 return 0;

}

was changed to check for and dispatch pending signals between ops, by changing
the default definition of PERL_ASYNC_CHECK() from

#ifndef PERL_ASYNC_CHECK
#define PERL_ASYNC_CHECK() NOOP
#endif

to this​:

#ifndef PERL_MICRO
# ifndef PERL_ASYNC_CHECK
# define PERL_ASYNC_CHECK() if (PL_sig_pending) despatch_signals()
# endif
#endif

#ifndef PERL_ASYNC_CHECK
# define PERL_ASYNC_CHECK() NOOP
#endif

(PERL_ASYNC_CHECK was originally added for MacOS classic. Classic is now gone)

PERL_ASYNC_CHECK is also present in the restart loops of various IO ops.

At the time this was believed to cause a slowdown, and IIRC a figure given
was 4% [citation needed :-)]. However, I might be misremembering, because
running perlbench today on blead I find that it's 4%.

(Change PERL_ASYNC_CHECK to NOOP, and force unsafe signals. Quite a few
regression tests SEGV. This is "perl5.11.5-unsafe" below)

Anyway, on IRC Reini Urban was investigating unrolling the OP loop, as part of
compiling perl, and observed that on x86 there are (IIRC) 18 assembly
instructions for the PERL_ASYNC_CHECK, but only (IIRC) 12 for the OP dispatch.
So I wondered if we could remove PERL_ASYNC_CHECK from the oploop, and put it
in relevant OPs.

BTW​: Just for reference. Here is my updated runops jitter x86 for
threaded and un-threaded.
I don't have numbers as it still segfaults but it looks like unthreaded
is faster unjitted and only threaded can be made faster jitted, without
major work, say two man-years.

Great patch, Nick!
--
Reini Urban
http​://phpwiki.org/ http​://murbreak.at/

@p5pRT
Copy link
Author

p5pRT commented Mar 11, 2010

From @rurban

/* Jit.xs
*
* Copyright (C) 2010 by Reini Urban
*
* You may distribute under the terms of either the GNU General Public
* License or the Artistic License, as specified in the README file.
*/

#include <EXTERN.h>
#include <perl.h>
#include <XSUB.h>

#ifndef _WIN32
#include <sys/mman.h>
#endif

#define T_CHARARR static unsigned char

/*
C pseudocode

  threaded​:
  my_perl->Iop = PL_op->op_ppaddr(my_perl);
  if (my_perl->Isig_pending) Perl_despatch_signals(my_perl);

  not-threaded​:
  PL_op = PL_op->op_ppaddr();
  if (PL_sig_pending) Perl_despatch_signals();
*/

#if (defined(__i386__) || defined(_M_IX86)) && defined(USE_ITHREADS)

/*
  x86 thr​: my_perl in ebx, my_perl->Iop in eax (ebx+4)
prolog​: my_perl passed on stack, but force 16-alignment for stack. core2/opteron just love that
  8D 4C 24 04 leal 4(%esp), %ecx
  83 E4 F0 andl $-16, %esp
  FF 71 FC pushl -4(%ecx)
call_near/far​:
  89 1c 24 mov %ebx,(%esp) ; push my_perl
  e8 xx xx xx xx call $PL_op->op_ppaddr - code
  or
  FF 25 xx xx xx xx jmp $PL_op->op_ppaddr ; call far 0x5214a4c5<Perl_pp_enter>
save_plop​:
  90 nop
  90 nop
  89 43 04 mov %eax,0x4(%ebx) ; save new PL_op into my_perl
PERL_ASYNC_CHECK​:
  movl %ebx, (%esi) ;891e
  movl %eax, 4(%esi) ;894604
  movl 900(%esi), %eax ;8b8684030000
  testl %eax, %eax ;85C0
  je +8 ;7408
  movl %esi, (%esp) ;893424
  call _Perl_despatch_signals ;FF25xxxxxxxx

after calling Perl_despatch_signals, restore my_perl into ebx and push for next
  83 c4 10 add $0x10,%esp
  83 ec 0c sub $0xc,%esp
  31 db xor %ebx,%ebx
  53 push %ebx

epilog after final Perl_despatch_signals
  83 c4 10 add $0x10,%esp
  8d 65 f8 lea -0x8(%ebp),%esp
  59 pop %ecx
  5b pop %ebx
  5d pop %ebp
  8d 61 fc lea -0x4(%ecx),%esp
  c3 ret
*/

/* my_perl already on stack, but force 16-alignment for stack */
T_CHARARR x86thr_prolog[] = {0x8d,0x4c,0x24,0x04,
  0x83,0xe4,0xf0,0xff,
  0x71,0xfc};
T_CHARARR x86thr_call_near[] = {0x89,0x1c,0x24,0xE8}; /* push my_perl, call near relative */
T_CHARARR x86thr_call_far[] = {0x89,0x1c,0x24,0xFF,
  0x25}; /* push my_perl, call far $PL_op->op_ppaddr */
T_CHARARR x86thr_save_plop[] = {0x90,0x89,0x43,0x04}; /* save new PL_op into my_perl */
T_CHARARR x86_nop[] = {0x90}; /* pad */
T_CHARARR x86thr_dispatch_getsig[] = {};
T_CHARARR x86thr_dispatch[] = {0x89,0x1e,0x89,0x46,
  0x04,0x8b,0x86,0x84,
  0x03,0x00,0x00,0x85,
  0xC0,0x74,0x08,0x89,
  0x34,0x24,0xFF,0x25}; /* check and call $Perl_despatch_signals */
/* after calling Perl_despatch_signals, restore my_perl into ebx and push for next.
  restore my_perl into ebx and push */
T_CHARARR x86thr_dispatch_post[] = {0x83,0xc4,0x10,0x83,
  0xec,0x0c,0x31,0xdb,
  0x53,0x90};
/* epilog after final Perl_despatch_signals */
T_CHARARR x86thr_epilog[] = {0x83,0xc4,0x10,0x8d,
  0x65,0xf8,0x59,0x5b,
  0x5d,0x8d,0x61,0xfc,
  0xc3,0x90};

# define PROLOG x86thr_prolog
# define CALL_NEAR x86thr_call_near
# define CALL_FAR x86thr_call_far
# define NOP x86_nop
# define SAVE_PLOP x86thr_save_plop
# define DISPATCH_GETSIG x86thr_dispatch_getsig
# define DISPATCH x86thr_dispatch
# define DISPATCH_POST x86thr_dispatch_post
# define EPILOG x86thr_epilog

#endif
#if (defined(__i386__) || defined(_M_IX86)) && !defined(USE_ITHREADS)

/*
x86 not-threaded
PL_op in eax, PL_sig_pending in ebx
Note​: It looks like gcc can inline some pp calls better than the jitter.
enter/nextstate/leave are inlined pretty good.

prolog​:
  55 push %ebp
  89 e5 mov %esp,%ebp
  #83 ec 04 sub $0x4,%esp
call_near/far​:
  e8 xx xx xx xx call $PL_op->op_ppaddr - code
  or
  FF 25 xx xx xx xx jmp $PL_op->op_ppaddr ; call far
save_plop​:
  90 nop
  a3 xx xx xx xx mov %eax,$PL_op ;0x4061c4
PERL_ASYNC_CHECK​:
  a1 xx xx xx xx mov $PL_sig_pending,%eax
  85 c0 test %eax,%eax
  74 05 je +5
  e8 xx xx xx xx call Perl_despatch_signals
epilog​:
  b8 00 00 00 00 mov $0x0,%eax
  5d pop %ebp
  #c9 leave
  c3 ret
*/

T_CHARARR x86_prolog[] = {0x55,0x89,0xe5}; /* just save ebp,esp */
T_CHARARR x86_call_near[] = {0xE8};
T_CHARARR x86_call_far[] = {0xFF,0x25}; /* jmp $PL_op->op_ppaddr */
T_CHARARR x86_save_plop[] = {0xa3}; /* save new PL_op */
T_CHARARR x86_nop[] = {0x90}; /* pad */
T_CHARARR x86_dispatch_getsig[] = {0xa1};
T_CHARARR x86_dispatch[] = {0x85,0xc0,0x74,0x05,
  0xFF,0x25};
T_CHARARR x86_dispatch_post[] = {};
T_CHARARR x86_epilog[] = {0xb8,0x00,0x00,0x00,
  0x5d,0xc3,0x90,0x90};

# define PROLOG x86_prolog
# define CALL_NEAR x86_call_near
# define CALL_FAR x86_call_far
# define NOP x86_nop
# define SAVE_PLOP x86_save_plop
# define DISPATCH_GETSIG x86_dispatch_getsig
# define DISPATCH x86_dispatch
# define DISPATCH_POST x86_dispatch_post
# define EPILOG x86_epilog
#endif

/* only after io funcs? rumor says after each and every op */
#define DISPATCH_NEEDED(op) 0

/*
Faster jitted execution path without loop,
selected with -MJit or (later) with perl -j.

All ops are unrolled in execution order for the CPU cache,
prefetching is the main advantage of this function.
The ASYNC check should be done only when necessary. (TODO)

For now only implemented for x86 with certain hardcoded my_perl offsets.
*/
int
Perl_runops_jit(pTHX)
{
  dVAR;
  register int i;
  unsigned char *code, *c;

  /* quirky pass 1​: need code size to allocate string.
  PL_slab_count should be near the optree size.
  Need to time that against an realloc checker in pass 2.
  */
  OP * root = PL_op;
  int size = 0;
  size += sizeof(PROLOG)/sizeof(PROLOG[0]);
  do {
  if (PL_op->op_type == OP_NULL) continue;
  size += sizeof(CALL_FAR)/sizeof(CALL_FAR[0]);
  size += sizeof(void*);
  while ((size || 0xfffffff0) % 4) {
  size++;
  }
  size += sizeof(SAVE_PLOP)/sizeof(SAVE_PLOP[0]);
#ifndef USE_ITHREADS
  size += sizeof(void*);
#endif
  if (DISPATCH_NEEDED(PL_op)) {
#ifndef USE_ITHREADS
  size += sizeof(DISPATCH_GETSIG)/sizeof(DISPATCH_GETSIG[0]);
  size += sizeof(void*);
#endif
  size += sizeof(DISPATCH)/sizeof(DISPATCH[0]);
  size += sizeof(void*);
  size += sizeof(DISPATCH_POST)/sizeof(DISPATCH_POST[0]);
  }
  } while (PL_op = PL_op->op_next);
  size += sizeof(EPILOG)/sizeof(EPILOG[0]);
  PL_op = root;
#ifdef _WIN32
  code = VirtualAlloc(NULL, size,
  MEM_COMMIT | MEM_RESERVE,
  PAGE_EXECUTE_READWRITE);
#else
  code = (char*)malloc(size);
#endif
  c = code;
  int rel;

#define PUSHc(what) memcpy(code,what,sizeof(what)); code += sizeof(what)

  /* pass 2​: jit */
  PUSHc(PROLOG);
  do {
  if (PL_op->op_type == OP_NULL) continue;
  PUSHc(CALL_FAR);
  PUSHc(&PL_op->op_ppaddr);
#if 0
  /* 386 jmp's need a few nop's, align it to 4 (048c)*/
  while ((&code || 0xfffffff0) % 4) {
  *(code++) = 0x90;
  }
#endif
  PUSHc(SAVE_PLOP);
#ifndef USE_ITHREADS
  PUSHc(&PL_op);
#endif
  if (DISPATCH_NEEDED(PL_op)) {
#ifndef USE_ITHREADS
  PUSHc(DISPATCH_GETSIG);
  PUSHc(&PL_sig_pending);
#endif
  PUSHc(DISPATCH);
  PUSHc(&Perl_despatch_signals);
#ifndef USE_ITHREADS
  PUSHc(DISPATCH_POST);
#endif
  }
  } while (PL_op = PL_op->op_next);
  PUSHc(EPILOG);

  /*I_ASSERT(size == (code - c));*/
  size = code - c;

  code = c;
#ifdef HAS_MPROTECT
  mprotect(code,size,PROT_EXEC|PROT_READ);
#endif
  /* XXX Missing. Prepare for execution​: flush CPU cache. Needed on some platforms */

  /* gdb​: disassemble code code+200 */
  (*((void (*)(pTHX))code))(aTHX);
 
#ifdef _WIN32
  VirtualFree(code, 0, MEM_RELEASE);
#else
  free(code);
#endif
  TAINT_NOT;
  return 0;
}

MODULE=Jit PACKAGE=Jit

PROTOTYPES​: DISABLE

BOOT​:
  PL_runops = Perl_runops_jit;

@p5pRT
Copy link
Author

p5pRT commented Mar 11, 2010

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Mar 11, 2010

From @hvds

Nicholas Clark (via RT) <perlbug-followup@​perl.org> wrote​:
:Anyway, on IRC Reini Urban was investigating unrolling the OP loop, as part of
:compiling perl, and observed that on x86 there are (IIRC) 18 assembly
:instructions for the PERL_ASYNC_CHECK, but only (IIRC) 12 for the OP dispatch.
:So I wondered if we could remove PERL_ASYNC_CHECK from the oploop, and put it
:in relevant OPs.

This is what I get for blead (gcc 4.4.2 with -O6, no threads) without and
with your patch​:
Perl_runops_standard​:
  push %ebp
  mov %esp, %ebp
  sub $8, %esp
l0​:
  mov PL_op, %eax
  call *op_ppaddr(%eax)
  test %eax, %eax
  mov %eax, PL_op
- je l1
- mov PL_sig_pending, %eax
- test %eax, %eax
- je l0
- call Perl_despatch_signals
- jmp l0
-l1​:
+ jne l0
  movb $0, PL_tainted
  leave
- nop
  ret

(Adapted from output generated by <http​://crypt.org/hv/perl/disfunc>.)

It is quite distressing that it doesn't avoid redundant reload of
PL_op after your patch is applied.

The patch at the end of this email could be applied over yours to give
the hint it needs. Based on your figures, I could imagine this giving
another 1% gain. All tests pass here at each patchlevel.

:AVERAGE 100 103 103 100 104 104
:
: ^ unmodified blead ^
: ^ safe signals ripped out ^
: ^ this patch ^

That's awesome. :)

Do you address map and grep? I believe they can run a long time without
invoking any of the flow control ops that I see touched in the patch.

It might also trip up modules that use perl ops but not flow control,
though I can't imagine what​: hopefully any such would be flushed out
before a 5.14 release.

:So it looks like this approach recovers *all* the speed lost by safe
:signals. (Which surprises me, as I would have expected only a partial
:improvement)

Offhand, I'd expect it to reduce the overhead by 1-2 orders of (decimal)
magnitude. So I suspect the cost is still there, just too small to see.

I think you might still notice it on very simple loop bodies​:
  for my $i (0 .. $MAX_X) {
  for my $j (0 .. $MAX_Y) {
  $a[$i][$j] = 0;
  }
  }
.. but where speed is an issue, such things should almost certainly be
rewritten either to map-like constructs using fewer ops, or to something
like Inline​::C. Maybe one day perl's optimiser will rewrite them for us. :)

Hugo

Inline Patch
--- run.c.old	2010-03-11 19:06:10.000000000 +0000
+++ run.c	2010-03-11 22:05:44.000000000 +0000
@@ -37,7 +37,8 @@
 Perl_runops_standard(pTHX)
 {
     dVAR;
-    while ((PL_op = CALL_FPTR(PL_op->op_ppaddr)(aTHX))) {
+    register OP* op = PL_op;
+    while ((PL_op = op = CALL_FPTR(op->op_ppaddr)(aTHX))) {
     }
 
     TAINT_NOT;

@p5pRT
Copy link
Author

p5pRT commented Mar 12, 2010

From [email protected]

On Mar 10, 2010, at 3​:23 AM, Nicholas Clark (via RT) wrote​:

(PERL_ASYNC_CHECK was originally added for MacOS classic. Classic
is now gone)

Not entirely. MacPerl may be gone, but Lamp (Lamp ain't Mac POSIX)
still runs perl cooperatively. PERL_ASYNC_CHECK() will periodically
call kill( 1, 0 ), which is a guaranteed yield point with no side
effects.

Josh

@p5pRT
Copy link
Author

p5pRT commented Mar 12, 2010

From @nwc10

On Thu, Mar 11, 2010 at 10​:26​:43PM +0000, hv@​crypt.org wrote​:

It is quite distressing that it doesn't avoid redundant reload of
PL_op after your patch is applied.

Mmm, yes

The patch at the end of this email could be applied over yours to give
the hint it needs. Based on your figures, I could imagine this giving
another 1% gain. All tests pass here at each patchlevel.

:AVERAGE 100 103 103 100 104 104
:
: ^ unmodified blead ^
: ^ safe signals ripped out ^
: ^ this patch ^

That's awesome. :)

Do you address map and grep? I believe they can run a long time without
invoking any of the flow control ops that I see touched in the patch.

I believe so, as I explicitly looked for all ops that do flow control
(ie have more than one possible return value), and also POP_BLOCK
dispatches signals.

It might also trip up modules that use perl ops but not flow control,
though I can't imagine what​: hopefully any such would be flushed out
before a 5.14 release.

I'm having a hard time too.

Offhand, I'd expect it to reduce the overhead by 1-2 orders of (decimal)
magnitude. So I suspect the cost is still there, just too small to see.

Aha. Right yes, that would make sense.

--- run.c.old 2010-03-11 19​:06​:10.000000000 +0000
+++ run.c 2010-03-11 22​:05​:44.000000000 +0000
@​@​ -37,7 +37,8 @​@​
Perl_runops_standard(pTHX)
{
dVAR;
- while ((PL_op = CALL_FPTR(PL_op->op_ppaddr)(aTHX))) {
+ register OP* op = PL_op;
+ while ((PL_op = op = CALL_FPTR(op->op_ppaddr)(aTHX))) {
}

 TAINT\_NOT;

Right, let's try​:

A) perl-5.11.5
  version = 5.011005
  path = /home/nclark/Sandpit/snap5.9.x-v5.11.5-104-gba555bf/bin/perl5.11.5-vanilla

B) perl-5.11.5
  version = 5.011005
  path = /home/nclark/Sandpit/snap5.9.x-v5.11.5-104-gba555bf/bin/perl5.11.5-safefast

C) perl-5.11.5
  version = 5.011005
  path = /home/nclark/Sandpit/snap5.9.x-v5.11.5-104-gba555bf/bin/perl5.11.5-safefast-hv

D) perl-5.11.5
  version = 5.011005
  path = /home/nclark/Sandpit/snap5.9.x-v5.11.5-108-g03dbe62/bin/perl5.11.5-HVshrink

E) perl-5.11.5
  version = 5.011005
  path = /home/nclark/Sandpit/snap5.9.x-v5.11.5-108-g03dbe62/bin/perl5.11.5-HVshrink-sfhv

F) perl-5.11.5
  version = 5.011005
  path = /home/nclark/Sandpit/snap5.9.x-v5.11.5-104-gba555bf/bin/perl5.11.5-safefast

G) perl-5.11.5
  version = 5.011005
  path = /home/nclark/Sandpit/snap5.9.x-v5.11.5-108-g03dbe62/bin/perl5.11.5-HVshrink

H) perl-5.11.5
  version = 5.011005
  path = /home/nclark/Sandpit/snap5.9.x-v5.11.5-104-gba555bf/bin/perl5.11.5-vanilla

I) perl-5.11.5
  version = 5.011005
  path = /home/nclark/Sandpit/snap5.9.x-v5.11.5-104-gba555bf/bin/perl5.11.5-safefast-hv

J) perl-5.11.5
  version = 5.011005
  path = /home/nclark/Sandpit/snap5.9.x-v5.11.5-108-g03dbe62/bin/perl5.11.5-HVshrink-sfhv

  A B C D E F G H I J
  --- --- --- --- --- --- --- --- --- ---
arith/mixed 100 109 119 107 111 112 104 102 119 111
arith/trig 100 109 113 100 112 108 99 99 112 111
array/copy 100 91 94 100 102 92 101 100 95 102
array/foreach 100 98 100 98 100 97 100 101 102 99
array/index 100 103 110 104 99 103 103 102 110 97
array/pop 100 95 98 100 103 98 98 98 94 102
array/shift 100 87 90 89 93 88 93 102 91 93
array/sort-num 100 99 98 101 100 98 101 100 99 100
array/sort 100 101 101 100 99 100 101 99 101 98
call/0arg 100 101 103 104 106 104 105 98 108 107
call/1arg 100 100 94 111 103 106 111 102 94 109
call/2arg 100 90 95 101 95 89 104 96 92 96
call/9arg 100 99 97 97 94 98 101 98 103 101
call/empty 100 113 111 120 111 113 120 100 104 110
call/fib 100 104 106 104 99 105 108 104 105 101
call/method 100 95 106 100 106 101 101 99 111 105
call/wantarray 100 104 107 99 107 102 102 99 107 108
hash/copy 100 97 98 93 92 96 93 104 98 93
hash/each 100 98 103 94 96 100 93 99 102 98
hash/foreach-sort 100 102 102 100 106 103 100 101 102 101
hash/foreach 100 103 104 101 101 106 104 103 101 102
hash/get 100 100 107 98 97 98 100 99 101 97
hash/set 100 103 107 101 104 102 99 99 107 104
loop/for-c 100 110 116 103 109 109 103 100 115 111
loop/for-range-const 100 129 135 100 137 133 101 100 135 135
loop/for-range 100 132 137 103 136 130 100 100 136 136
loop/getline 100 102 102 97 104 102 98 100 102 107
loop/while-my 100 126 104 102 139 127 105 97 135 138
loop/while 100 135 96 95 143 140 101 101 143 140
re/const 100 100 106 102 100 101 99 101 103 98
re/w 100 98 102 98 92 98 98 92 101 93
startup/fewmod 100 99 99 98 100 99 100 100 99 100
startup/lotsofsub 100 100 100 101 100 100 100 100 100 101
startup/noprog 100 101 101 101 99 100 101 100 101 100
string/base64 100 100 109 105 110 99 104 100 110 109
string/htmlparser 100 100 102 99 100 100 99 101 101 100
string/index-const 100 98 107 103 112 98 103 99 106 110
string/index-var 100 101 105 80 93 102 79 100 107 93
string/ipol 100 107 120 107 105 105 106 99 121 113
string/tr 100 101 104 102 100 101 102 101 103 104

AVERAGE 100 104 105 100 105 104 101 100 107 106
  ^ unmodified blead ^
  ^ first patch ^
  ^ with your mod too ^
  ^ branch HVshrink ^
  ^ HVshrink + patch + mod ^

Yes, it makes it faster, but it's within the noise level.
I also tried it with my (topic) branch HvFILL-shrink​:

http​://perl5.git.perl.org/perl.git/shortlog/refs/heads/nicholas/HvFILL-shrink

which reduces struct XPVHV by one IV, but that doesn't speed it up further
(here).
It should help generally however - fixed overhead for a hash is down from 9
to 8 "things" (either 32 or 64 bit, depending on config), and XPVHV (I think)
will now be cache aligned on all architectures.

Nicholas Clark

@p5pRT
Copy link
Author

p5pRT commented Mar 13, 2010

From @rurban

2010/3/11 <hv@​crypt.org>​:

Nicholas Clark (via RT) <perlbug-followup@​perl.org> wrote​:
:Anyway, on IRC Reini Urban was investigating unrolling the OP loop, as part of
:compiling perl, and observed that on x86 there are (IIRC) 18 assembly
:instructions for the PERL_ASYNC_CHECK, but only (IIRC) 12 for the OP dispatch.
:So I wondered if we could remove PERL_ASYNC_CHECK from the oploop, and put it
:in relevant OPs.

This is what I get for blead (gcc 4.4.2 with -O6, no threads) without and
with your patch​:
 Perl_runops_standard​:
    push   %ebp
    mov    %esp, %ebp
    sub    $8, %esp
 l0​:
    mov    PL_op, %eax

This mov PL_op, %eax should not be needed.

    call   *op_ppaddr(%eax)
    test   %eax, %eax

This test neither.

    mov    %eax, PL_op
-    je     l1
-    mov    PL_sig_pending, %eax
-    test   %eax, %eax
-    je     l0
-    call   Perl_despatch_signals
-    jmp    l0
-l1​:
+    jne    l0
    movb   $0, PL_tainted
    leave
-    nop
    ret

(Adapted from output generated by <http​://crypt.org/hv/perl/disfunc>.)

Thanks.
I always use objdump -d to get the linked disassembly, better than gcc
-save-temps.
But gcc -save-temps -fverbose-asm is also nice.

I have this for my jitted x86 no-threads,
but I haven't measured my jmp against a near call
and a seperate jmp table as gcc uses it.
I don't loop around the calls, I unroll it to give the cpu a chance to
prefetch it.
But a far call cannot be parallelized.

prolog​:
  55 pushl %ebp
  89 e5 movl %esp,%ebp
  83 ec 08 subl $0x8,%esp

call​:
  ff 25 xx xx xx xx call $PL_op->op_ppaddr ; call far
save_op​:
  90 nop
  a3 xx xx xx xx mov %eax,$PL_op ;0x4061c4
dispatch_getsig​:
  a1 xx xx xx xx mov $PL_sig_pending,%eax
dispatch​:
  85 c0 test %eax,%eax
  74 05 je +5
  e8 xx xx xx xx call Perl_despatch_signals

epilog​:
  b8 00 00 00 00 mov $0x0,%eax
  c9 leave
  c3 ret

It is quite distressing that it doesn't avoid redundant reload of
PL_op after your patch is applied.

The patch at the end of this email could be applied over yours to give
the hint it needs. Based on your figures, I could imagine this giving
another 1% gain. All tests pass here at each patchlevel.

:AVERAGE                  100     103     103     100     104     104
:
:                          ^    unmodified blead   ^
:                                  ^    safe signals ripped out    ^
:                                          ^   this patch  ^

That's awesome. :)

Do you address map and grep? I believe they can run a long time without
invoking any of the flow control ops that I see touched in the patch.

It might also trip up modules that use perl ops but not flow control,
though I can't imagine what​: hopefully any such would be flushed out
before a 5.14 release.

:So it looks like this approach recovers *all* the speed lost by safe
:signals. (Which surprises me, as I would have expected only a partial
:improvement)

Offhand, I'd expect it to reduce the overhead by 1-2 orders of (decimal)
magnitude. So I suspect the cost is still there, just too small to see.

I think you might still notice it on very simple loop bodies​:
 for my $i (0 .. $MAX_X) {
   for my $j (0 .. $MAX_Y) {
     $a[$i][$j] = 0;
   }
 }
.. but where speed is an issue, such things should almost certainly be
rewritten either to map-like constructs using fewer ops, or to something
like Inline​::C. Maybe one day perl's optimiser will rewrite them for us. :)

Hugo
--- run.c.old   2010-03-11 19​:06​:10.000000000 +0000
+++ run.c       2010-03-11 22​:05​:44.000000000 +0000
@​@​ -37,7 +37,8 @​@​
 Perl_runops_standard(pTHX)
 {
    dVAR;
-    while ((PL_op = CALL_FPTR(PL_op->op_ppaddr)(aTHX))) {
+    register OP* op = PL_op;
+    while ((PL_op = op = CALL_FPTR(op->op_ppaddr)(aTHX))) {
    }

    TAINT_NOT;

Ah, this is where the additional PL_op in eax is coming from.
Looks slower to me.
But apparently, according to Nick's bench it's faster.
--
Reini Urban
http​://phpwiki.org/ http​://murbreak.at/

@p5pRT
Copy link
Author

p5pRT commented Apr 15, 2010

From @nwc10

On Thu, Mar 11, 2010 at 10​:26​:43PM +0000, hv@​crypt.org wrote​:

Hugo
--- run.c.old 2010-03-11 19​:06​:10.000000000 +0000
+++ run.c 2010-03-11 22​:05​:44.000000000 +0000
@​@​ -37,7 +37,8 @​@​
Perl_runops_standard(pTHX)
{
dVAR;
- while ((PL_op = CALL_FPTR(PL_op->op_ppaddr)(aTHX))) {
+ register OP* op = PL_op;
+ while ((PL_op = op = CALL_FPTR(op->op_ppaddr)(aTHX))) {
}

 TAINT\_NOT;

"Applied" (by hand) as 339aac2.

Nicholas Clark

@p5pRT
Copy link
Author

p5pRT commented Apr 15, 2010

@nwc10 - Status changed from 'open' to 'resolved'

@p5pRT p5pRT closed this as completed Apr 15, 2010
@p5pRT
Copy link
Author

p5pRT commented Apr 15, 2010

From @rurban

2010/4/15 Nicholas Clark <nick@​ccl4.org>​:

On Thu, Mar 11, 2010 at 10​:26​:43PM +0000, hv@​crypt.org wrote​:

Hugo
--- run.c.old 2010-03-11 19​:06​:10.000000000 +0000
+++ run.c     2010-03-11 22​:05​:44.000000000 +0000
@​@​ -37,7 +37,8 @​@​
 Perl_runops_standard(pTHX)
 {
     dVAR;
-    while ((PL_op = CALL_FPTR(PL_op->op_ppaddr)(aTHX))) {
+    register OP* op = PL_op;
+    while ((PL_op = op = CALL_FPTR(op->op_ppaddr)(aTHX))) {
     }

     TAINT_NOT;

"Applied" (by hand) as 339aac2.

Oh! Did you time it?
I just studied the assembly, but haven't got to benchmarking it yet.

I am working on a version which passes down the local op as arg down
to all run-time pp calls, non-threaded only. Wonder if it's worth the
effort, more and more calls need to be changed.
http​://github.com/rurban/perl/commit/c993951eaaefb97f2cabdf560a42aa6dc21cbd63
(not up-to-date yet)
--
Reini Urban
http​://phpwiki.org/ http​://murbreak.at/

@p5pRT
Copy link
Author

p5pRT commented Apr 15, 2010

From @nwc10

On Thu, Apr 15, 2010 at 04​:53​:52PM +0200, Reini Urban wrote​:

2010/4/15 Nicholas Clark <nick@​ccl4.org>​:

On Thu, Mar 11, 2010 at 10​:26​:43PM +0000, hv@​crypt.org wrote​:

Hugo
--- run.c.old 2010-03-11 19​:06​:10.000000000 +0000
+++ run.c     2010-03-11 22​:05​:44.000000000 +0000
@​@​ -37,7 +37,8 @​@​
 Perl_runops_standard(pTHX)
 {
     dVAR;
-    while ((PL_op = CALL_FPTR(PL_op->op_ppaddr)(aTHX))) {
+    register OP* op = PL_op;
+    while ((PL_op = op = CALL_FPTR(op->op_ppaddr)(aTHX))) {
     }

     TAINT_NOT;

"Applied" (by hand) as 339aac2.

Oh! Did you time it?
I just studied the assembly, but haven't got to benchmarking it yet.

perlbench suggested '103', compared to '100' without. (Bigger is better).
Of course, I don't trust perlbench for anything less than 105 :-)
(Or for real world programs)

Nicholas Clark

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant