-
Notifications
You must be signed in to change notification settings - Fork 51
Consider FOR_ITER family for specialization? #67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I definitely don't think that There may have been some previous hacks, see comments by user @heeres in #42. IIRC the idea was to reuse the integer object rather than allocating a new one. However, making sure that there are no other users of that same object is tricky. Of course, if we ever get tagged integers, that would be the right approach. Also, in case you hadn't looked yet, range() already has two entirely distinct implementations depending on whether the end points fit in a machine word. But of course this just saves doing the increment using integer objects, the result still has to be allocated. (A free list might help too?) |
Oops I didn't see that issue. It's cool that the solutions and problems somewhat converge with this issue. Also those are some really nice hacks by Heeres! You mentioned trying out FOR_ITER/{STORE_NAME|STORE_FAST} combined instruction. This makes me wonder if the super instruction would be faster than individually specializing FOR_ITER and STORE. Or maybe the fastest would be specializing that combined instruction ;)? |
We’ll, there’s not much to specialize about STORE_FAST, and STORE_NAME isn’t worth it (a for loop at the module level?!) So the super-instruction seems a good idea; specializing it for range() seems good too. You might try to measure how often ‘for i in range(…)’ actually occurs. I think in my own code I use enumerate() more than range(), but that is harder to specialize (or is it?). It would be a nice win to avoid needing the tuple there (this would need help from the compiler). |
One possible way to speed up tight
instead of
I don't know if this would be profitable, but might be worth an experiment. |
After some hacking around, I think my original approach is pretty naive -- User heeres is right that most of the overhead is from the memory allocation. I'm surprised though -- I expected with how pymalloc works, that I will attempt to use the specializing interpreter features to apply their optimizations in a safer manner (with full credits given to them of course). And hopefully, it will have better results. |
I'm abandoning this idea. I integrated the PyLong reuse hacks above into a specialized instruction, and it produces a significant pessimization on ranges with small longs [1]. Since there's a small int cache in CPython, any long from -5 to 256 is essentially free. So the overhead of an adaptive instruction is significant. I would expect such small ranges to cover most use cases. Additionally, to get the full test suite to pass, I had to be extremely conservative. This means more than half the time it still had to allocate a new long. I don't see this going too far without the compiler's help. There is some speedup on very large ranges, but it was only 10% on a microbench [2]. I'm not sure what settings user heeres was testing with (eg. debug, release, PGO?) for 50%, and I might have implemented things wrongly, so I can't provide an explanation. One guess is that pymalloc is quite efficient after PGO. Also if you're dealing with such large ranges, you're probably better off using numpy, cython, numba etc. Overall, even if we can fix the other shortcomings, for the complexity it introduces it's not profitable. TLDR: Combined instructions or Mark's rearranged instructions look like a better fit. I would think Note: All results were obtained on Windows with PGO. Also I know I should use pyperf instead, but I was short on time 😉 . Sorry. [1]
[2]
Commit: Fidget-Spinner/cpython@e5393d6 |
Thanks for doing all the research! If you want to combine FOR_ITER and LOAD_FAST in the compiler, that could be a super-instruction (#16) added in the assembly phase. I have done a few of these, but never merged any upstream (https://github.com/faster-cpython/cpython/tree/super-instr). Also check out Mark's recent attempt (https://github.com/faster-cpython/cpython/tree/super-instructions). |
I'm on a roll producing things with 0 speedups 😉 . FOR_ITER + STORE_FAST showed no speedups too. Seems like the existing
Before: Superinstruction: Commit: Fidget-Spinner/cpython@c8a1338 |
To close the loop: It seems that Larry tried something similar (though with a different approach) back in 2015: https://bugs.python.org/issue24138. He decided it wasn't worth it in the end. That spawned some attempts at a long freelist in https://bugs.python.org/issue24165. But the pyperformance numbers were a little murky, so that was abandoned too. |
Thanks for the links! I wonder if maybe we should look at that integer
freelist again.
|
Once other operations are faster, the overhead of the integer allocation will be a bigger percentage. So maybe it will be noticeable in 3.11 :). Oh, and one more really cool idea by Serhiy, which I think will probably get merged https://bugs.python.org/issue45026. In short, the PyLong to return is currently calculated by |
I'm reopening this. Specialization of A specialization for list iterators might be worthwhile. |
Specialization of generatorsCurrently
This is a lot of overhead for what is just a link-and-branch. A specialization of
Which should be much faster than the current rigmarole. The only tricky part is returning from the generator which will need to be handled differently from a normal return. |
If we are going to specialize for generators, then we will have to pay the price of PEP 659, so we should also specialize for |
Type classification stats for
There are no hits for coroutines and async generators. |
Stats gathered with python/cpython#31079 |
I tried out a prototype of this here: python/cpython@main...sweeneyde:for_end
Maybe architecturally it really ought to be a peephole optimization like below but there is something wrong and it wasn't working:
|
This should become |
FOR_END does fall through in one of the two branches, so Is it somehow problematic to take a basicblock that had no fallthrough and then add a fallthrough to it? |
Ha, yep. Time for me to go home. It may be the if condition:
# This FOR_ITER will be peepholed first, and the jump will be threaded
# straight to baz()...
for i in range(42):
foo()
# ...then, I *think* your new peephole optimization will stick a
# FOR_END here that tries to fall through to baz()...
else:
bar()
# ...but we already fall through to it here!
baz() If so, then adding an explicit |
Hi everyone, my Python implementation did some specialization for the For generator optimization, we had some nifty idea that we prototyped in our ZipPy VM, which resulted in massive speedups. Some of the details may not directly apply to CPython, since this was done in Truffle/Graal, but there were some interpreter benefits as well AFAIR. This optimization was accepted at OOPSLA'14, and if there is some interest, I can look it up and provide the necessary details for discussion. All the best, |
My current attempt at replacing JUMP_ABSOLUTE/FOR_ITER with FOR_END at python/cpython@main...sweeneyde:for_end has a flaw in how it makes some tracing have to happen. For example, I have a failing test case
The dis output, with the two differing lines labelled:
So either there's some fix to the tracing logic that I'm missing (I'm not super familiar), or FOR_ITER needs to stick around and duplicate most of the FOR_END logic, which isn't ideal IMO. Any suggestions? |
Perhaps use This is where the magic happens, so you'll want to start there if you're going to be debugging weird traces. (Looking at it now, maybe that |
I broke the trace in a similar way with a JUMP_BACK opcode. It turned out that I needed to patch this function in frameobject.c to add the new jump type to the switch. Maybe your problem is there as well. |
Even if range() microbenchmarks are hard to budge, the code path for list iterators is a lot smaller, so there's probably a decent speedup possible from specializing there. Dict items and maybe enumerate (the next most common couple types in pyperformance) could be made to not have to construct a tuple at all, unpacking directly onto the stack. And then generators could speed up as well, if desired. A FOR_ITER-->FOR_END change seems to make only microscopic improvements at best, but it seems like it could be good to make a decision about whether that's appropriate to do before trying to specialize. I opened a FOR_END issue at python/cpython#91432 and a PR at python/cpython#70016 that got the tracing bugs ironed out. On the other hand, one wild idea would be to quicken |
Just a heads-up: almost all built-in iterators that yield tuples already typically avoid constructing more than one (for exactly this reason), so this may not pay off much. |
With the recent Microbenchmarks show ~1.1x speedup for lists, ~1.5x speedup for small ranges, but ~2.9x speedup for larger ranges! This comes from modifying the target local in place. It's almost like taking two of @Fidget-Spinner's patches and squishing them together. Pyperformance numbers coming soon. |
@sweeneyde, @markshannon: With python/cpython#91713 merged, is this issue finished? |
I think this is done.
|
I haven't properly profiled anything. Throwing this idea here to see what people think.
Consider this fairly common pattern in Python:
Bytecode snippet:
Anecdotally, some users are surprised at how much overhead this has. For most simple for loops, users intend for the range objects to be used as the equivalent
for(int i=0; i < x; i++)
loop in C. The range objects are created then thrownaway immediately.In those cases, calling
next(range_iterator)
inFOR_ITER
is unnecessary overhead. We can unbox this into a simplePyLong
object, thenPyLong_Add
on it. This can be aFOR_ITER_RANGE
opcode.This will have to be extremely conservative. We can only implement this on range objects with a reference count of 1 (ie used only for the for loop) and they must be the builtin range and not some monkeypatched version. Without the compiler’s help, the following would be very dangerous to optimize:
We can also do the same with lists and tuples (
FOR_ITER_LIST
andFOR_ITER_TUPLE
), but use the the nativePyTuple/List_GetItem
instead of iterator protocol. But I'm not sure how common something likefor x in (1,2,3)
is. So maybe those aren't worth it (they're also much harder to rollback if the optimization breaks halfway).FOR_ITER
isn't a common instruction. But I think when it's used it matters, because it tends to be rather loopy code.What do y'all think?
The text was updated successfully, but these errors were encountered: