Skip to content

proof of concept/performance test for use float #17831

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: blead
Choose a base branch
from

Conversation

tonycoz
Copy link
Contributor

@tonycoz tonycoz commented Jun 2, 2020

This is an attempt at #17813

I tested performance with a simple mandelbrot set generator (on my old CPU):

tony@mars:.../git/perl2$ time ./perl -Ilib ../mandel.pl

real    0m25.752s
user    0m25.612s
sys     0m0.132s
tony@mars:.../git/perl2$ time ./perl -Ilib -Mfeature=float ../mandel.pl

real    0m19.751s
user    0m19.742s
sys     0m0.004s

@tonycoz tonycoz added the do not merge Don't merge this PR, at least for now label Jun 2, 2020
@toddr toddr added the Feature A New Feature. label Jul 30, 2020
@atoomic atoomic added the needs-work The pull request needs changes still label Jul 30, 2020
@toddr toddr added has conflicts and removed needs-work The pull request needs changes still labels Jul 31, 2020
@tonycoz
Copy link
Contributor Author

tonycoz commented Aug 3, 2020

I see two upvotes - did anyone else try benchmarking this on more useful code?

I ask to see if it's worth developing this further.

I implemented this as a feature, but it doesn't really belong there since it's not a language feature as such, it shouldn't be enabled by a feature version bundle.

I'm hesitant to use a hints bit since we're fairly short on them.

Simply using an entry in %^H has the same problems that it did for indirect feature before features were cached in cop_features - we'd be adding a hash lookup for every binop or unop generated.

Maybe it could be implemented as a feature, but not included in the all feature set, and not documented in feature.pm.

@richardleach
Copy link
Contributor

This still seems worthwhile to me, but non of my useful code really uses float math so nothing handy to benchmark.

@atoomic
Copy link
Member

atoomic commented Aug 5, 2020

note that we recovered some hint bits with 5d17394
so maybe it s fine to steal one bit for float?

I've not tested/benchmarked this on other code.

@tonycoz
Copy link
Contributor Author

tonycoz commented Sep 15, 2020

note that we recovered some hint bits with 5d17394
so maybe it s fine to steal one bit for float?

That recovered only a single bit which is now assigned to the feature mask, where it belongs.

Maybe we just need another hints word.

@jkeenan
Copy link
Contributor

jkeenan commented Jan 26, 2021

@tonycoz , @richardleach , @atoomic, Can we get an update on the status of this p.r.?

Thank you very much.
Jim Keenan

@tonycoz
Copy link
Contributor Author

tonycoz commented Jan 26, 2021

It's waiting on (likely) adding another hints word.

But I think that needs to wait on reducing the cost of COPs which those are embedded into.

Right now a COP is generated for every statement, but the information in each COP typically doesn't change much except for the line number. I've looked at adding an alternative COP which only has a line number, but this will break some backward compatibility at the XS level.

@richardleach
Copy link
Contributor

I noticed that the regular versions of these functions do:

+      TARGn(left * right, 0);
+      SETs( TARG );

rather than:
- SETn( left * right );
to try harder to avoid calling sv_setnv_mg.

@demerphq
Copy link
Collaborator

demerphq commented Sep 2, 2022 via email

@tonycoz
Copy link
Contributor Author

tonycoz commented Sep 5, 2022

On Tue, 26 Jan 2021 at 23:32, Tony Cook @.***> wrote: It's waiting on (likely) adding another hints word. But I think that needs to wait on reducing the cost of COPs which those are embedded into. Right now a COP is generated for every statement, but the information in each COP typically doesn't change much except for the line number. I've looked at adding an alternative COP which only has a line number, but this will break some backward compatibility at the XS level.
I'd like to hear more about this as it aligns with my interest in improving the quality of our error messages. If can do any legwork here id be happy to hear an appraisal of the problem to get started with. Just mail me personally. You know where. :-) Yves

I've stalled on this a bit (error: stack overflow), but I did get a "small COP" large implemented and I don't remember getting any crashes. I still needed to update caller() to understand the new COPs.

There may have been other problems though, I wasn't comfortable with the way I was detecting whether a small COP was possible, eg with code like:

line1;
line2;
if (...) { #line3
   line4;
   line5;
   no strict '...';
   line7;
}
line9;
line10;

lines 1, 4, 7, 9 needed full COPs, and I hadn't gotten to the point of checking that was happening when it should.

Even without adding a small COP we could improve memory usage a great deal by reference counting cop_warnings, and I think cop_file on threads, these are profligate users of memory - each cop has it's own copy.

@demerphq
Copy link
Collaborator

Even without adding a small COP we could improve memory usage a great deal by reference counting cop_warnings, and I think cop_file on threads, these are profligate users of memory - each cop has it's own copy.

In theory it should be pretty easy to use PL_strtab to do that if they are write-once. I will take a look. Do you have a branch for your small cop work?

@tonycoz
Copy link
Contributor Author

tonycoz commented Oct 26, 2022

Do you have a branch for your small cop work?

It's very hacky and incomplete (and probably just plain broken), but https://github.com/Perl/perl5/tree/tonyc/less-cop

@demerphq
Copy link
Collaborator

demerphq commented Oct 27, 2022 via email

@iabyn
Copy link
Contributor

iabyn commented Nov 7, 2022 via email

@bbrtj
Copy link
Contributor

bbrtj commented Nov 26, 2022

I have compiled perl from your branch and tested it on two pieces of code that use some float calculations.

First one is Algorithm::QuadTree::PP, which uses some (not much) float math in its circular shape finding routine. No improvement was seen.

The second one is more math-heavy, as it tries to find all border coordinates for a line segment. The heart of the function is implemented as follows:

my $coeff_x = ($position2->[1] - $position1->[1]) / ($position2->[0] - $position1->[0]);

my $checks_for_x = sub ($pos_x) {
	state $partial = $position1->[1] - $position1->[0] * $coeff_x;
	my $pos_y = $partial + $pos_x * $coeff_x;
	return ([$pos_x, $pos_y], [$pos_x - 1, $pos_y]);
};

my $checks_for_y = sub ($pos_y) {
	state $partial = $position1->[0] - $position1->[1] / $coeff_x;
	my $pos_x = $partial + $pos_y / $coeff_x;
	return ([$pos_x, $pos_y], [$pos_x, $pos_y - 1]);
};

my @coords = (
	(map { $checks_for_x->($_) } $position1->[0] + 1 .. $position2->[0]),
	(map { $checks_for_y->($_) } $position1->[1] + 1 .. $position2->[1])
);

Those two anonymous coderefs are then run for each integer coordinate of x and y. They are called about 20 times each and the entire function runs 40 thousand times per second, but I see no improvement on the benchmark if the function starts with use feature 'float'; (I expect this feature works in lexical scope).

I don't think I have anything else at the moment that has more float math in it.

@tonycoz
Copy link
Contributor Author

tonycoz commented Nov 27, 2022

I don't think I have anything else at the moment that has more float math in it.

I suspect sub call overhead is drowning the math costs.

From memory I used the following to benchmark it:

use strict;
my $max_iter = 100;
++$|;
for my $iy (0 .. 1000) {
  my $y = -1 + 0.002 * $iy;
  for my $ix (0 .. 1000) {
    my $x = -1 + 0.002 * $ix;
    my $i = 0;
    my $xo = $x;
    my $yo = $y;
    my $iter = 0;
    while ($xo * $xo + $yo * $yo <= 10 && ++$iter < $max_iter) {
      ($xo, $yo) = ( $xo * $xo - $yo * $yo + $x, 2 * $xo * $yo + $y);
    }
  }
  print ".";
}
print "\n";

which I probably adapted from a C sample in Imager.

@bbrtj
Copy link
Contributor

bbrtj commented Nov 28, 2022

I suspect sub call overhead is drowning the math costs.

With all math commented out (but variable declarations etc. left in), it runs about 20% faster, so I assume math takes about 16% of its runtime. When benchmarking your code I see 20-40% improvement, which would mean my code should run about 5-10% faster (taking into account your code also spends some of its runtime assigning variables etc.). You're right, that might not be enough to show on a benchmark.

@demerphq
Copy link
Collaborator

demerphq commented Feb 8, 2023

@tonycoz - i implemented RCPV filename and warnings bits, so we have redcuced the size of cops considerable (all together), so maybe we can reconsider making the hints bits bigger now?

Anyway, this PR is old and in conflict. Maybe we should get it rebased so it can be reconsidered?

@tonycoz
Copy link
Contributor Author

tonycoz commented Feb 8, 2023

I look at rebasing it, though probably not today.

I'll look at the extra hints word too, though I'm not sure we'll store it for eval (see where doeval_compile() initializes PL_hints).

Since I'm more familiar with features, I've implemented this as a
feature.

At this point there are no new tests.

# Conflicts:
#	ext/Opcode/Opcode.pm
#	feature.h
#	lib/B/Op_private.pm
#	lib/feature.pm
#	opcode.h
#	opnames.h
#	pp_proto.h
#	regen/feature.pl
#	regen/opcode.pl
@EdwardDanchetzNI
Copy link

It looks like you ran a performance test on a mandelbrot set generator in Perl, comparing the performance of using float versus not using float. The test showed that using float improved the performance by about 6 seconds, with the script running in 19.751 seconds with the float option versus 25.752 seconds without it.

@bulk88
Copy link
Contributor

bulk88 commented Oct 23, 2024

Do you have a branch for your small cop work?

It's very hacky and incomplete (and probably just plain broken), but https://github.com/Perl/perl5/tree/tonyc/less-cop

I've seen this done in other code bases differently.

Each OP has a bitfield (0xF or 16) of how many lines it is away from the last COP. Emit new COP every 16 lines of non branching code.

Array of line numbers (not seq) per sub, each line num struct, has a U32 mask, each bit in the mask represents a OP struct that is on the line. No runtime penalty for updating line numbers. But O(32) ish to find line number during an exception. Note this requires even sized OPs, P5 are not equal sized. Exception table/JIT peephole "sub-unit" is 32 or 64 OPs max obviously.

Store an array of U8 chars in the CV of op's position in the CV OP slab, PL_curcop+U8 offset=real line number.

Have a global interp line number offset from last COP. OP_NEXTLINE just ++s it. Smaller than little COP design here. no line number field needed. Fat COP can bump line number by units of NOT 1.

@jkeenan jkeenan added Feature Request Feature A New Feature. and removed Feature A New Feature. Feature Request labels Dec 29, 2024
@richardleach
Copy link
Contributor

We could add the new OPs and associated machinery independently of the COP changes. Despite the inability to explicilty use float without the COP work, the peephole optimizer could swap in a new OP for its generic counterpart when an operand is a CONST NV (that cannot be losslessly converted to an IV). In such cases, we know that a float operation will get carried out.

@bulk88
Copy link
Contributor

bulk88 commented Jun 13, 2025

@tonycoz - i implemented RCPV filename and warnings bits, so we have redcuced the size of cops considerable (all together), so maybe we can reconsider making the hints bits bigger now?

Anyway, this PR is old and in conflict. Maybe we should get it rebased so it can be reconsidered?

the C lang C level HV* front end has always been very ugly. PP OP *s usually are manipulating SV* PV HEK*s strings and feeding them to the HV* API. C99 functions are always feeding hv_fetch(hv, "TheKey", 6, FALSE, (U32_HASH)0); C strings with no precalced U32 hash number, and no char* == char* skip the memcmp() libc call optimization. the 25 SV_CONST(EXISTS) macros are a right step towards fixing the C lang C level HV* front end, but 25^2 more SV*s are needed in that array to hold the full Perl 5 BNF grammer that lives inside libperl.so. Just run strings on libperl.so and after ignoring UTF invl related C strings, you will start to see my point that all ISO C "" lit strings need to be "CC time" or "PP compile time" converted into SV* PV HEK*s exactly 1x per perl proc lifetime.

@bulk88
Copy link
Contributor

bulk88 commented Jun 13, 2025

We could add the new OPs and associated machinery independently of the COP changes. Despite the inability to explicilty use float without the COP work, the peephole optimizer could swap in a new OP for its generic counterpart when an operand is a CONST NV (that cannot be losslessly converted to an IV). In such cases, we know that a float operation will get carried out.

Correct, regardless what is said on permonks so hn and reddit, P5 has machine types, the only limitation is PP devs can't turn off the operator overloading and cant disable the implicit <dynamic_cast> methods. Javascript/ECMAScript has the same exact design defect as Perl 5, which is POD doesn't exist and every identifier is full blown OOP, yet what backwards unrefined dangerous programming language written by 1 guy over a weekend are you reading my comment with?

cough cough http://fglock.github.io/Perlito/
cough cough https://webperl.zero-g.net/democode/index.html

https://web.dev/articles/performance-mystery
http://wingolog.org/archives/2011/07/05/v8-a-tale-of-two-compilers

now lets look at some C code

https://github.com/v8/v8/blob/master/src/interpreter/bytecodes.h#L647

  // Return true if |bytecode| is an accumulator load without effects,
  // e.g. LdaConstant, LdaTrue, Ldar.
  static constexpr bool IsAccumulatorLoadWithoutEffects(Bytecode bytecode) {
    STATIC_ASSERT(Bytecode::kLdar < Bytecode::kLdaImmutableCurrentContextSlot);
    return bytecode >= Bytecode::kLdar &&
           bytecode <= Bytecode::kLdaImmutableCurrentContextSlot;
  }

OMG, did I just see Chrome browser just write

if (SvREADONLY(sv) && !SvMAGICAL(sv) && !SvROK(sv))  {
   ck_something_fold(op);
    op_free_something(op->op_next);
}

??? !!!

But Perl since 2014 is doing escape analysis IN THE RUNLOOP INSIDE PP_ENTERSUB!!!!

https://github.com/Perl/perl5/blob/blead/pp_hot.c#L6404

        {
            SV **svp = MARK;
            while (svp < PL_stack_sp) {
                SV *sv = *++svp;
                if (!sv)
                    continue;
                if (SvPADTMP(sv)) {
                    SV *newsv = sv_mortalcopy(sv);
                    *svp = newsv;
#ifdef PERL_RC_STACK
                    /* should just skip the mortalisation instead */
                    SvREFCNT_inc_simple_void_NN(newsv);
                    SvREFCNT_dec_NN(sv);
#endif
                    sv = newsv;
                }
                SvTEMP_off(sv);
            }
        }

perl5/pp_hot.c

Line 6451 in d6f09a8

items = PL_stack_sp - MARK;

            items = PL_stack_sp - MARK;
            if (UNLIKELY(items - 1 > AvMAX(av))) {
                SV **ary = AvALLOC(av);
                Renew(ary, items, SV*);
                AvMAX(av) = items - 1;
                AvALLOC(av) = ary;
                AvARRAY(av) = ary;
            }

            if (items)
                Copy(MARK+1,AvARRAY(av),items,SV*);
            AvFILLp(av) = items - 1;
#ifdef PERL_RC_STACK
            /* transfer ownership of the arguments' refcounts to av */
            PL_stack_sp = MARK;
#endif
        }

I'm sure there is a valid technical rational for the above, but I really despise seeing my yellow arrow enter this block when Im holding down F11.

https://github.com/Perl/perl5/blob/d6f09a896842e5288af5d3817756b67a919ad7ad/pp_hot.c#L6525C1-L6541C10

        else {
            SV **mark = PL_stack_base + markix;
            SSize_t items = PL_stack_sp - mark;
            while (items--) {
                mark++;
                if (*mark && SvPADTMP(*mark)) {
                    SV *oldsv = *mark;
                    SV *newsv = sv_mortalcopy(oldsv);
                    *mark = newsv;
#ifdef PERL_RC_STACK
                    /* should just skip the mortalisation instead */
                    SvREFCNT_inc_simple_void_NN(newsv);
                    SvREFCNT_dec_NN(oldsv);
#endif
                }
            }
        }

Like for real, is @_ an AJAX socket? I thought Perl subs live in the same virtual address space for speed reasons, but I guess I was wrong, each PP sub or XSUB executes on a different Android smartphone, over a SDN VPN and HTTP2 JSON packets over LTE.

@bulk88
Copy link
Contributor

bulk88 commented Jun 13, 2025

So, to read paragraph # 2, needs 1 min of engineering classroom time.

I, bulk88, call Google's JSObjects, in Perl-ese talk, I will say Perl calls JSObject a SvROK() or calls it a SV* head struct. What Google calls SMIs, I will say Perl calls it a SVt_IV bodyless SV*, or perhaps Perl calls Google's SMIs as PL_sv_immortals[]/&PL_sv_yes.

Here is Google's PL_ppaddr[]; array, its easy to read

https://github.com/v8/v8/blob/master/src/compiler/opcodes.h

The pp_foo() with names that aggressively flaunt raw CPU machine types instead of JSObjects or SMIs, In my professional opinion do not belong inside the P5P distributed libperl.so VM. C89/99/C++23's liveness, escape analysis, and strict alias rules, along with the inability of runtime real live end user continuous instrumenting where each extra 100 entry hits to a Perl sub/ CV* obj, triggers another pass of progressive JITting and more and more big O(ness) big wall time op_ck_foo() rounds on the PP AST. I think the Larry Wall Day 1 Perl 5 C level mix of high level/low level ness of pp_foo() funcs is fine. We don't want to go down to SSA and RTL and FPGA transistor wiring with a 66/110 punch tool. But we don't want each pp_foo() func or PP opcode to be a turing complete CPU emulator/LLM neural network framework.

Its also inappropriate and the ship has sailed very long ago, to turn Larry's Perl 5 engine, into a 9-15 MB .css or .xml or GNUMakefile file that is loaded by LLVM/V8/Davik/Appl Swift interpreters/compilers. Raku failed, NQP is the leftovers, Parrot, and Reini Urban (search GH perl11 or his GH for its name, I compiled it 1x ever for Win32 but it insta SEGVs) and Rperl have all converted the Larry created C89 Perl 5 code base, into a single .css file that is fed into a non-C compiler, and the final binary executes Perl 5 code with about 75-85% source code compatible and bug compatibility, with P5P blead's distributed .pm and .pl files.

Other than RPerl, I dont think any Perl 5 grammar turned into a .css executing on a foreign JIT interp runtime, has ever out benchmarked the stock C89 P5P code base. Rperl is clone of https://en.wikipedia.org/wiki/Asm.js but for Perl 5 lang, libperl.so is always inside the address space of a Rperl process. All RPerl PP subs can be step debugged by perl5db.pl. RPerl's main method of operation is, certain AST-ed/OP treed CV*s, if they match an extremely strict "regexp", get emitted into super clean perfect IMO, C89 src code, that gets sent through FooOS's vanilla CC toolchain, and eventually become XSUBs. Its done at runtime/BEGIN{} time with no extra work by the end user.

I believe RPerls author, but this is my guess, I've never asked, and he has never told me himself, but from my single stepping the FOSS RPerl code, and troubleshooting it with him. I have a hunch he either whiteboarded, or added provisions, or the author or his employer would sell a commercial version of RPerl, that automatically takes PP P5 grammar subroutines, and compiles them and uploads them to a Nvidia RTX GPGPU card, from inside the P5P Perl_runops_standard() runloop.

This is pointless for 80% of Perl users and their production code, but RPerl is a death sentence to CPAN's PDL module. Since RPerl allows AI/Crypto Coin mining/big data/big sci computing code, to be written by entry level PP devs in Perl 5, after reading the camel book. Not learning a foreign programming language called PDL and FFI-ing into the PDL abstract virtual machine.

If I had a personal hobby or business reason to do it, I will just write 0x500 bytes of machine code +.rdata x64/i386 JIT assembler and throw it into leonerd's builtin.c and now the PSC can release Perl 7 with killer new features like JIT from PP.

*Disclaimer P5P does not accept any bug reports for any SEGVs caused by broken JIT x64 created with *main::builtin::jit:: package, if the bug reporter can't write it in GCC inline ASM without a SEGV, they can't write in Perl 5 PP code high performance JIT either

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do not merge Don't merge this PR, at least for now Feature A New Feature. hasConflicts Stalled
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants