Consider FOR_ITER family for specialization? #67

Fidget-Spinner · 2021-07-13T13:43:53Z

I haven't properly profiled anything. Throwing this idea here to see what people think.

Consider this fairly common pattern in Python:

for i in range(len(x)):
    ...

Bytecode snippet:

           6 GET_ITER
     >>    8 FOR_ITER                 2 (to 14)

Anecdotally, some users are surprised at how much overhead this has. For most simple for loops, users intend for the range objects to be used as the equivalent for(int i=0; i < x; i++) loop in C. The range objects are created then thrownaway immediately.

In those cases, calling next(range_iterator) in FOR_ITER is unnecessary overhead. We can unbox this into a simple PyLong object, then PyLong_Add on it. This can be a FOR_ITER_RANGE opcode.

This will have to be extremely conservative. We can only implement this on range objects with a reference count of 1 (ie used only for the for loop) and they must be the builtin range and not some monkeypatched version. Without the compiler’s help, the following would be very dangerous to optimize:

x = iter(range(10))
for _ in x:
    next(x)

We can also do the same with lists and tuples (FOR_ITER_LIST and FOR_ITER_TUPLE), but use the the native PyTuple/List_GetItem instead of iterator protocol. But I'm not sure how common something like for x in (1,2,3) is. So maybe those aren't worth it (they're also much harder to rollback if the optimization breaks halfway).

FOR_ITER isn't a common instruction. But I think when it's used it matters, because it tends to be rather loopy code.

What do y'all think?

The text was updated successfully, but these errors were encountered:

gvanrossum · 2021-07-15T04:27:18Z

I definitely don't think that for x in (1, 2, 3) or for x in [1, 2, 3] are common enough to bother. I have been eyeing for i in range() myself, but before the specialization framework was implemented it seemed tricky.

There may have been some previous hacks, see comments by user @heeres in #42. IIRC the idea was to reuse the integer object rather than allocating a new one. However, making sure that there are no other users of that same object is tricky.

Of course, if we ever get tagged integers, that would be the right approach.

Also, in case you hadn't looked yet, range() already has two entirely distinct implementations depending on whether the end points fit in a machine word. But of course this just saves doing the increment using integer objects, the result still has to be allocated. (A free list might help too?)

Fidget-Spinner · 2021-07-15T14:26:39Z

Oops I didn't see that issue. It's cool that the solutions and problems somewhat converge with this issue. Also those are some really nice hacks by Heeres!

You mentioned trying out FOR_ITER/{STORE_NAME|STORE_FAST} combined instruction. This makes me wonder if the super instruction would be faster than individually specializing FOR_ITER and STORE. Or maybe the fastest would be specializing that combined instruction ;)?

gvanrossum · 2021-07-15T15:29:12Z

We’ll, there’s not much to specialize about STORE_FAST, and STORE_NAME isn’t worth it (a for loop at the module level?!) So the super-instruction seems a good idea; specializing it for range() seems good too.

You might try to measure how often ‘for i in range(…)’ actually occurs. I think in my own code I use enumerate() more than range(), but that is harder to specialize (or is it?). It would be a nice win to avoid needing the tuple there (this would need help from the compiler).

markshannon · 2021-07-16T10:50:45Z

One possible way to speed up tight for loops is to add a FOR_END opcode, which would be like FOR_ITER but reverse the direction of branch. It would not branch when the iterator is exhausted.
So for x: body would compile to:

    FOR_ITER end
loop:
    body 
    FOR_END loop
end:

instead of

loop:
    FOR_ITER end
    body 
    JUMP_ABSOLUTE loop
end:

I don't know if this would be profitable, but might be worth an experiment.

Fidget-Spinner · 2021-07-20T14:15:13Z

After some hacking around, I think my original approach is pretty naive -- next(iter) seems to be very little overhead (just calling tp_iternext which increments and creates a PyLong).

User heeres is right that most of the overhead is from the memory allocation. I'm surprised though -- I expected with how pymalloc works, that _PyObject_Malloc would be very cheap.

I will attempt to use the specializing interpreter features to apply their optimizations in a safer manner (with full credits given to them of course). And hopefully, it will have better results.

Fidget-Spinner · 2021-07-21T12:32:54Z

I'm abandoning this idea. I integrated the PyLong reuse hacks above into a specialized instruction, and it produces a significant pessimization on ranges with small longs [1].

Since there's a small int cache in CPython, any long from -5 to 256 is essentially free. So the overhead of an adaptive instruction is significant. I would expect such small ranges to cover most use cases.

Additionally, to get the full test suite to pass, I had to be extremely conservative. This means more than half the time it still had to allocate a new long. I don't see this going too far without the compiler's help.

There is some speedup on very large ranges, but it was only 10% on a microbench [2]. I'm not sure what settings user heeres was testing with (eg. debug, release, PGO?) for 50%, and I might have implemented things wrongly, so I can't provide an explanation. One guess is that pymalloc is quite efficient after PGO. Also if you're dealing with such large ranges, you're probably better off using numpy, cython, numba etc. Overall, even if we can fix the other shortcomings, for the complexity it introduces it's not profitable.

TLDR: Combined instructions or Mark's rearranged instructions look like a better fit. I would think FOR_ITER + ~~LOAD_FAST~~ (woops, STORE_FAST) would show benefits. But they cross basic blocks, so optimization is more difficult. Maybe interpreter specializations can be re-looked at again for late-stage optimizations.

Note: All results were obtained on Windows with PGO. Also I know I should use pyperf instead, but I was short on time 😉 . Sorry.

[1]

python.exe -m timeit "for _ in range(0, 256): pass"
Before:
100000 loops, best of 5: 2.17 usec per loop
Specialized:
100000 loops, best of 5: 2.45 usec per loop

[2]

python.exe -m timeit "for _ in range(256, 10000): pass"
Before:
2000 loops, best of 5: 138 usec per loop
Specialized:
2000 loops, best of 5: 126 usec per loop

Commit: Fidget-Spinner/cpython@e5393d6

gvanrossum · 2021-07-21T15:46:19Z

Thanks for doing all the research!

If you want to combine FOR_ITER and LOAD_FAST in the compiler, that could be a super-instruction (#16) added in the assembly phase. I have done a few of these, but never merged any upstream (https://github.com/faster-cpython/cpython/tree/super-instr). Also check out Mark's recent attempt (https://github.com/faster-cpython/cpython/tree/super-instructions).

Fidget-Spinner · 2021-07-25T07:30:31Z

I'm on a roll producing things with 0 speedups 😉 . FOR_ITER + STORE_FAST showed no speedups too. Seems like the existing PREDICT macro is doing a really good job. On Windows with PGO:

python.exe -m timeit -s
>> "def f(n):
>>  for _ in range(n):
>>   pass
>> " "f(10_000)"

Before:
2000 loops, best of 5: 138 usec per loop
2000 loops, best of 5: 141 usec per loop
2000 loops, best of 5: 138 usec per loop

Superinstruction:
2000 loops, best of 5: 140 usec per loop
2000 loops, best of 5: 137 usec per loop
2000 loops, best of 5: 137 usec per loop

Commit: Fidget-Spinner/cpython@c8a1338

Fidget-Spinner · 2021-09-18T03:38:42Z

To close the loop:

It seems that Larry tried something similar (though with a different approach) back in 2015: https://bugs.python.org/issue24138. He decided it wasn't worth it in the end.

That spawned some attempts at a long freelist in https://bugs.python.org/issue24165. But the pyperformance numbers were a little murky, so that was abandoned too.

gvanrossum · 2021-09-18T17:56:35Z

Thanks for the links! I wonder if maybe we should look at that integer freelist again.

Fidget-Spinner · 2021-09-19T03:00:59Z

Once other operations are faster, the overhead of the integer allocation will be a bigger percentage. So maybe it will be noticeable in 3.11 :).

Oh, and one more really cool idea by Serhiy, which I think will probably get merged https://bugs.python.org/issue45026. In short, the PyLong to return is currently calculated by (start + index * step) (index is the index of the iterator, not the return value). The change stores the previous Pylong in start, so the new PyLong uses just (start + step). Iterating ranges are 2x faster in his PR!

markshannon · 2021-11-08T12:00:17Z

I'm reopening this. Specialization of FOR_ITER is worthwhile, if only for generators and coroutines. A specialization of FOR_ITER for generators can avoid consuming C stack in the same way as we currently do for calls to Python functions.
If we are going to specialize for generators anyway, additional specializations may be worthwhile.

A specialization for list iterators might be worthwhile.
Handling range is probably out of scope for PEP 659 specialization, as we want to perform scalar replacement of the iterator which is going to be a job for a higher tier optimizer.

markshannon · 2022-01-04T14:39:38Z

Specialization of generators

Currently FOR_ITER for a generator has to go through the following C call stack:

gen_iternext which can't be inlined as it is called through a function pointer
gen_send_ex2 which probably isn't inlined as it is too big.
_PyEval_EvalFrameDefault which obviously isn't going to be inlined.

This is a lot of overhead for what is just a link-and-branch.

A specialization of FOR_ITER for a generator would need to do the following:

Push None to the generator's stack.
Push the generator's frame and resume.

Which should be much faster than the current rigmarole.

The only tricky part is returning from the generator which will need to be handled differently from a normal return.
The obvious way to do this is to add a new opcode GEN_RETURN which would pop the frame, jump to the FOR_ITER target and resume.

markshannon · 2022-01-04T14:41:52Z

If we are going to specialize for generators, then we will have to pay the price of PEP 659, so we should also specialize for range, list, and any other common cases.

markshannon · 2022-02-02T15:20:15Z

Type classification stats for FOR_ITER running the standard benchmarks:

Class	Count	Percentage
other	111048830	7.4%
generator	30044759	2.0%
list	555532957	37.0%
tuple	59327730	4.0%
set	66675780	4.4%
str	957966	0.1%
bytes	646850	0.0%
range	516693401	34.4%
itertools (all types)	7129894	0.5%
dict (keys)	1024633	0.1%
dict (items)	61149814	4.1%
dict (values)	225081	0.0%
enumerate	90033090	6.0%

FOR_ITER represents 2% of the total bytecodes executed.

There are no hits for coroutines and async generators.

markshannon · 2022-02-02T15:21:16Z

Stats gathered with python/cpython#31079

sweeneyde · 2022-02-20T19:38:21Z

    FOR_ITER end
loop:
    body 
    FOR_END loop
end:

instead of

loop:
    FOR_ITER end
    body 
    JUMP_ABSOLUTE loop
end:

I tried out a prototype of this here: python/cpython@main...sweeneyde:for_end

Pyperformance: https://gist.github.com/sweeneyde/17e6e2c426d6a77034e551534a82a3d4
(Geometric mean: 1.01x faster)

Microbenchmark:
./python -m pyperf timeit -s "A = [0.0] * 1000" "for a in A: +a" -o "pr_micro.json"
Mean +- std dev: [main_micro] 11.6 us +- 0.0 us -> [pr_micro] 10.6 us +- 0.2 us: 1.09x faster

Maybe architecturally it really ought to be a peephole optimization like below but there is something wrong and it wasn't working:

case (JUMP_ABSOLUTE targeted at FOR_ITER):
    assert(i == bb->b_iused - 1);
    inst->i_opcode = FOR_END;
    inst->i_target = inst->i_target->b_next;
    bb->b_next = target->i_target;
    bb->b_nofallthrough = 0;
    COPY_INSTR_LOC(*inst, *target);
    --i;
    break;

brandtbucher · 2022-03-01T01:14:28Z

bb->b_nofallthrough = 0;

This should become 1, right? Maybe that's why it didn't work?

sweeneyde · 2022-03-01T01:39:09Z

bb->b_nofallthrough = 0;

This should become 1, right? Maybe that's why it didn't work?

FOR_END does fall through in one of the two branches, so nofallthrough should become false I believe (an unfortunate double-negative). Since the instruction was a JUMP_ABSOLUTE, nofallthrough should be true already I think? Without that line, some assertion about nofallthrough failed IIRC.

Is it somehow problematic to take a basicblock that had no fallthrough and then add a fallthrough to it?

brandtbucher · 2022-03-01T02:06:36Z

FOR_END does fall through in one of the two branches, so nofallthrough should become false I believe (an unfortunate double-negative).

Ha, yep. Time for me to go home.

It may be the bb->b_next = target->i_target; move, then. You might have gotten yourself into a situation where two different blocks fall through to the same target or something:

if condition:
    # This FOR_ITER will be peepholed first, and the jump will be threaded 
    # straight to baz()...
    for i in range(42):
        foo()
        # ...then, I *think* your new peephole optimization will stick a
        # FOR_END here that tries to fall through to baz()...
else:
    bar()
    # ...but we already fall through to it here!
baz()

If so, then adding an explicit JUMP_FORWARD instead might fix it.

sbrunthaler · 2022-03-15T20:22:05Z

Hi everyone,

my Python implementation did some specialization for the FOR_ITER instructions, which are documented in the PhD thesis and also in the ECOOP paper. So maybe it's worth to take a look and do some simple measurements whether this still makes sense on much more modern hardware.

For generator optimization, we had some nifty idea that we prototyped in our ZipPy VM, which resulted in massive speedups. Some of the details may not directly apply to CPython, since this was done in Truffle/Graal, but there were some interpreter benefits as well AFAIR. This optimization was accepted at OOPSLA'14, and if there is some interest, I can look it up and provide the necessary details for discussion.

All the best,
--stefan

sweeneyde · 2022-03-20T07:48:44Z

My current attempt at replacing JUMP_ABSOLUTE/FOR_ITER with FOR_END at python/cpython@main...sweeneyde:for_end has a flaw in how it makes some tracing have to happen. For example, I have a failing test case test_05_no_pop_tops (test.test_sys_settrace.TraceTestCase), which has the following

def no_pop_tops():      # 0
    x = 1               # 1
    for a in range(2):  # 2
        if a:           # 3
            x = 1       # 4
        else:           # 5
            x = 1       # 6

AssertionError: events did not match expectation:
  (0, 'call')
  (1, 'line')
  (2, 'line')
+ (2, 'line')
  (3, 'line')
  (6, 'line')
+ (2, 'line')
  (2, 'line')
  (3, 'line')
  (4, 'line')
  (2, 'line')
  (2, 'return')

The dis output, with the two differing lines labelled:

>>> from test.test_sys_settrace import no_pop_tops
>>> from dis import dis
>>> dis(no_pop_tops)
116           0 RESUME                   0

117           2 LOAD_CONST               1 (1)
              4 STORE_FAST               0 (x)

118           6 LOAD_GLOBAL              1 (NULL + range)
             18 LOAD_CONST               2 (2)
             20 PRECALL                  1
             24 CALL                     1
             34 GET_ITER
             36 JUMP_FORWARD             8 (to 54)    # <=========== THIS IS NEW
        >>   38 STORE_FAST               1 (a)

119          40 LOAD_FAST                1 (a)
             42 POP_JUMP_IF_FALSE       25 (to 50)

120          44 LOAD_CONST               1 (1)
             46 STORE_FAST               0 (x)
             48 JUMP_FORWARD             2 (to 54)

122     >>   50 LOAD_CONST               1 (1)
             52 STORE_FAST               0 (x)

118     >>   54 FOR_END                 19 (to 38)    # <=========== THIS IS NEW
             56 LOAD_CONST               0 (None)
             58 RETURN_VALUE

So either there's some fix to the tracing logic that I'm missing (I'm not super familiar), or FOR_ITER needs to stick around and duplicate most of the FOR_END logic, which isn't ideal IMO. Any suggestions?

brandtbucher · 2022-03-23T18:41:27Z

Perhaps use ADDOP_JUMP_NOLINE for FOR_END? You might need to manually insert a NOP with the correct line number just before it (in the previous block, at the end of the loop), too. Not sure.

This is where the magic happens, so you'll want to start there if you're going to be debugging weird traces. (Looking at it now, maybe that i->i_lineno should really be 0 < i->i_lineno? It seems wrong in its current form.)

iritkatriel · 2022-03-24T12:23:00Z

I broke the trace in a similar way with a JUMP_BACK opcode. It turned out that I needed to patch this function in frameobject.c to add the new jump type to the switch. Maybe your problem is there as well.

sweeneyde · 2022-04-12T23:31:17Z

Even if range() microbenchmarks are hard to budge, the code path for list iterators is a lot smaller, so there's probably a decent speedup possible from specializing there. Dict items and maybe enumerate (the next most common couple types in pyperformance) could be made to not have to construct a tuple at all, unpacking directly onto the stack. And then generators could speed up as well, if desired.

A FOR_ITER-->FOR_END change seems to make only microscopic improvements at best, but it seems like it could be good to make a decision about whether that's appropriate to do before trying to specialize. I opened a FOR_END issue at python/cpython#91432 and a PR at python/cpython#70016 that got the tracing bugs ironed out.

On the other hand, one wild idea would be to quicken JUMP_BACKWARD to FOR_END_ADAPTIVE when its target is a FOR_ITER. That way, we get the same old readable unquickend bytecode, but then get the benefits of one less opcode per loop. It would make JUMP_BACKWARD take some cache entries, but maybe that's okay, or maybe we could even make it only take up cache entries when it's in that situation.

brandtbucher · 2022-04-13T15:19:35Z

Dict items and maybe enumerate (the next most common couple types in pyperformance) could be made to not have to construct a tuple at all, unpacking directly onto the stack.

Just a heads-up: almost all built-in iterators that yield tuples already typically avoid constructing more than one (for exactly this reason), so this may not pay off much.

sweeneyde · 2022-04-20T00:40:32Z

With the recent NOTRACE_DISPATCH_SAME_OPARG() fix, I was able to get a plain-old FOR_ITER specialization working for lists and ranges. Code is here: python/cpython#91713

Microbenchmarks show ~1.1x speedup for lists, ~1.5x speedup for small ranges, but ~2.9x speedup for larger ranges! This comes from modifying the target local in place. It's almost like taking two of @Fidget-Spinner's patches and squishing them together.

Pyperformance numbers coming soon.

mdboom · 2022-08-02T16:14:49Z

@sweeneyde, @markshannon: With python/cpython#91713 merged, is this issue finished?

markshannon · 2022-09-08T16:42:47Z

I think this is done.

We already have specializations for range, list, and tuple. The function pointer tp_next is probably good enough for other classes.
Another approach to this is Virtual iterators #392
Specializing for generators is covered by Specialization of generator iteration and coroutine await #457

Fidget-Spinner changed the title ~~Consider FOR_ITER family of specialization?~~ Consider FOR_ITER family for specialization? Jul 13, 2021

Fidget-Spinner closed this as completed Jul 21, 2021

markshannon mentioned this issue Jul 27, 2021

Specializations #74

Closed

markshannon reopened this Nov 8, 2021

gramster added this to Fancy CPython Board Jan 10, 2022

gramster moved this to Todo in Fancy CPython Board Jan 10, 2022

gramster moved this from Todo to Other in Fancy CPython Board Jan 10, 2022

gramster moved this from Other to Todo in Fancy CPython Board Jan 24, 2022

sweeneyde mentioned this issue Apr 10, 2022

Optimizing FOR_ITER bytecode python/cpython#91432

Closed

sweeneyde mentioned this issue Apr 24, 2022

gh-91432: Specialize FOR_ITER python/cpython#91713

Merged

markshannon removed this from Fancy CPython Board Aug 1, 2022

mdboom added the epic-specialization More specialization work for 3.12 label Aug 2, 2022

markshannon mentioned this issue Aug 16, 2022

Handling the three-way exits of generators and coroutines efficiently. #448

Open

markshannon closed this as completed Sep 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider FOR_ITER family for specialization? #67

Consider FOR_ITER family for specialization? #67

Fidget-Spinner commented Jul 13, 2021 •

edited

Loading

gvanrossum commented Jul 15, 2021

Fidget-Spinner commented Jul 15, 2021 •

edited

Loading

gvanrossum commented Jul 15, 2021

markshannon commented Jul 16, 2021

Fidget-Spinner commented Jul 20, 2021 •

edited

Loading

Fidget-Spinner commented Jul 21, 2021 •

edited

Loading

gvanrossum commented Jul 21, 2021

Fidget-Spinner commented Jul 25, 2021 •

edited

Loading

Fidget-Spinner commented Sep 18, 2021

gvanrossum commented Sep 18, 2021 via email

Fidget-Spinner commented Sep 19, 2021

markshannon commented Nov 8, 2021

markshannon commented Jan 4, 2022

markshannon commented Jan 4, 2022

markshannon commented Feb 2, 2022 •

edited

Loading

markshannon commented Feb 2, 2022

sweeneyde commented Feb 20, 2022 •

edited

Loading

brandtbucher commented Mar 1, 2022

sweeneyde commented Mar 1, 2022

brandtbucher commented Mar 1, 2022 •

edited

Loading

sbrunthaler commented Mar 15, 2022

sweeneyde commented Mar 20, 2022

brandtbucher commented Mar 23, 2022

iritkatriel commented Mar 24, 2022

sweeneyde commented Apr 12, 2022 •

edited by gvanrossum

Loading

brandtbucher commented Apr 13, 2022

sweeneyde commented Apr 20, 2022

mdboom commented Aug 2, 2022

markshannon commented Sep 8, 2022

Consider FOR_ITER family for specialization? #67

Consider FOR_ITER family for specialization? #67

Comments

Fidget-Spinner commented Jul 13, 2021 • edited Loading

gvanrossum commented Jul 15, 2021

Fidget-Spinner commented Jul 15, 2021 • edited Loading

gvanrossum commented Jul 15, 2021

markshannon commented Jul 16, 2021

Fidget-Spinner commented Jul 20, 2021 • edited Loading

Fidget-Spinner commented Jul 21, 2021 • edited Loading

gvanrossum commented Jul 21, 2021

Fidget-Spinner commented Jul 25, 2021 • edited Loading

Fidget-Spinner commented Sep 18, 2021

gvanrossum commented Sep 18, 2021 via email

Fidget-Spinner commented Sep 19, 2021

markshannon commented Nov 8, 2021

markshannon commented Jan 4, 2022

Specialization of generators

markshannon commented Jan 4, 2022

markshannon commented Feb 2, 2022 • edited Loading

markshannon commented Feb 2, 2022

sweeneyde commented Feb 20, 2022 • edited Loading

brandtbucher commented Mar 1, 2022

sweeneyde commented Mar 1, 2022

brandtbucher commented Mar 1, 2022 • edited Loading

sbrunthaler commented Mar 15, 2022

sweeneyde commented Mar 20, 2022

brandtbucher commented Mar 23, 2022

iritkatriel commented Mar 24, 2022

sweeneyde commented Apr 12, 2022 • edited by gvanrossum Loading

brandtbucher commented Apr 13, 2022

sweeneyde commented Apr 20, 2022

mdboom commented Aug 2, 2022

markshannon commented Sep 8, 2022

Fidget-Spinner commented Jul 13, 2021 •

edited

Loading

Fidget-Spinner commented Jul 15, 2021 •

edited

Loading

Fidget-Spinner commented Jul 20, 2021 •

edited

Loading

Fidget-Spinner commented Jul 21, 2021 •

edited

Loading

Fidget-Spinner commented Jul 25, 2021 •

edited

Loading

markshannon commented Feb 2, 2022 •

edited

Loading

sweeneyde commented Feb 20, 2022 •

edited

Loading

brandtbucher commented Mar 1, 2022 •

edited

Loading

sweeneyde commented Apr 12, 2022 •

edited by gvanrossum

Loading