-
-
Notifications
You must be signed in to change notification settings - Fork 32.1k
Add combined opcodes #87850
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I'm lining up some PRs (inspired by some of Mark Shannon's ideas) that add new opcodes which are straightforward combinations of existing opcodes. For example, ADD_INT is equivalent to LOAD_CONST + BINARY_ADD, for certain small (common) integer constants. Each of these adds only a minor speedup, but after a dozen or so of these the speedup is (hopefully) significant enough to warrant the opcode churn. |
We usually rejected such propositions unless there were evidences of significant effect on performance. I myself withdrawn several my patches after analyzing statistics. What percent of opcodes will be ADD_INT in the compiled bytecode? What percent of executed opcodes will be ADD_INT? Compare it with other opcodes, in particularly BINARY_ADD. What are microbenchmark results? |
Microbench results for ADD_INT: I have a few others like this lined up -- Mark recommended submitting them one at a time to make review easier. My philosophy here (which I learned from Tim Peters in the early 2000s) is that even though each individual improvement has no measurable effect on a general benchmark (as shown in the same comment), the combined effect of a number of tiny improvements can be significant. We have reached a point where there are very few individual changes possible that still have a significant effect, but we should not use this as an excuse to stop trying to improve performance. I have mislaid the results that made me think this particular opcode is a good candidate; I will go back to reproduce that result and get back to you here. |
""" And sometimes more so than the naïve sum of their individual contributions! Although this was often much clearer on older architectures and compilers (both more predictable). Indeed, in the old days I routinely checked "optimizations" in that _slowed_ microbenchmarks, provided they reduced the work on the critical path, and didn't make the code markedly harder to follow (but reducing the critical path usually makes the code clearer!). Because, sooner or later, compilers will get smart enough to see what I saw, and generate better code. And then these can compound. Like "oh! these three temp variables don't actually need to be materialized at all anymore, and suddenly I have few enough that do need to be materialized that _all_ of them can live in fast registers...". So, at some point, the next seemingly insignificant optimization checked in could _appear_ to have an "inexplicable" benefit, by breaking the bottleneck on _some_ inscrutable HW resource overtaxed on the critical path by the original code. So optimize for what a smart compiler will eventually do, not what they happen to do today ;-) Profile-guided optimization was a major step in that direction. Saddest to me: layers of indirection introduced to support marginal features, because "well, they don't slow down pybench by much at all". That's the opposite: over time, they can compound to do worse damage than the sum of their individual burdens. In the end, Everything Counts™. |
Statically, in the stdlib, about 1 in a 1000 opcodes is an ADD_INT candidate. I'll have to do some more work to come up with a dynamic count. |
I would expect that ADD_INT would typically occur in loops, so even if it is a rare opcode statically, it will really count when it used. |
So dynamically, I see about 0.5% of all opcodes are ADD_INT, so higher than statically (which was 0.1%) but still small enough that I'm abandoning work on ADD_INT. I've got a branch where I'm adding POP_JUMP_IF_NONE and POP_JUMP_IF_NOT_NONE (which capture if x is None, if x is not None) but the static count there is still pretty low, around 0.2% statically. These numbers are remarkably stable -- I get 0.2% exactly for mypy, 0.18% for the stdlib. (A bit more in the azure-cli source code: 0.35%.) I expect that the dynamic count here would be a bit higher too: without running it, I'd expect around 1%, if we take it that the average loop does about 5 iterations -- my results for ADD_INT would lead to that conclusion, and this was independently verified by an Instagram blog post from 2017. (https://instagram-engineering.com/profiling-cpython-at-instagram-89d4cbeeb898, search for "loopyness") Is that enough to persevere? I really don't know. I have a few other potential opcode combinations, RETURN_CONST and RETURN_NONE, but since those (by definition) don't occur in a loop, I'm even less optimistic about the value of those. We really should optimize things like LOAD_FAST + LOAD_FAST, which occurs much more frequently. However the way to encode that combination is pretty gross, and I'd rather do that another time. |
Interesting. What code did you use to collect statistics? I used patches in bpo-27255. Perhaps it is worth to add an optionally compiled code for collecting dynamic opcode statistics permanently. We have around 125 opcodes, so 0.5% looks too small. But what is the range of that opcode? How many opcodes are more frequent? And what is the range of BINARY_ADD for comparison. RETURN_CONST or RETURN_NONE was already discussed before. They can save few bytes in bytecode, but they are never used in tight loops and the time difference is dwarfed by the frame creation overhead in any case. For frequently used pairs of opcodes we have the PREDICT() macros which reduces overhead. |
For static statistics I wrote my own script: https://github.com/python/cpython/pull/25090/files#diff-994d3592c951c78cbe71084d562590d1507ddfed767e2ec040f5e2610845a11c. I can add that to Tools permanently (I have a slightly improved version that can look for different things).
We have that already: compile with -DDYNAMIC_EXECUTION_PROFILE -DDXPAIRS. There's a script (Tools/scripts/analyze_dxp.py) that works with such data, but it's primitive. Eric is working on improvements; I added my own hacky script to run an app and collect the data. I collected a few different sets of statistics: static stats for the stdlib and for mypy, dynamic stats for running mypy and a few of the benchmarks in pyperformance. Eric and I have plans to do this more systematically; we'll then publish our tools and results. |
I'm sure you've considered this, but I'd be *very* careful using opcode stats from the official benchmarks to inform work on these improvements. It's probably better to stick to in-sample data from the stdlib / mypy / etc. and use the benchmarks *only* for measuring out-of-sample results. Otherwise we may just end up hacking the benchmark suite. :) |
Yeah, I've received many warnings about benchmark hacking already, and we For static statistics, it's easy enough to either download a thousand of For dynamic statistics, the problem is always to come up with a Running test suites (whether it's the stdlib tests or some other set of I wish it was the case that static stats can be a good proxy for dynamic |
We now have POP_JUMP_IF_NONE and LOAD_FAST__LOAD_FAST. So most of the ideas here are implemented, and a few of the others were rejected. If nobody will object I will close this as completed. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: