Skip to content

Add combined opcodes #87850

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
gvanrossum opened this issue Mar 31, 2021 · 12 comments
Closed

Add combined opcodes #87850

gvanrossum opened this issue Mar 31, 2021 · 12 comments
Labels
3.10 only security fixes pending The issue will be closed if no feedback is provided performance Performance or resource usage

Comments

@gvanrossum
Copy link
Member

BPO 43684
Nosy @gvanrossum, @tim-one, @rhettinger, @terryjreedy, @gpshead, @markshannon, @ericsnowcurrently, @serhiy-storchaka, @pablogsal, @brandtbucher, @isidentical
PRs
  • bpo-43684: Add ADD_INT opcode #25090
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = None
    created_at = <Date 2021-03-31.22:40:32.519>
    labels = ['3.10', 'performance']
    title = 'Add combined opcodes'
    updated_at = <Date 2021-04-06.21:41:49.532>
    user = 'https://github.com/gvanrossum'

    bugs.python.org fields:

    activity = <Date 2021-04-06.21:41:49.532>
    actor = 'gvanrossum'
    assignee = 'none'
    closed = False
    closed_date = None
    closer = None
    components = []
    creation = <Date 2021-03-31.22:40:32.519>
    creator = 'gvanrossum'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 43684
    keywords = ['patch']
    message_count = 11.0
    messages = ['389939', '389967', '390101', '390104', '390105', '390109', '390263', '390292', '390350', '390373', '390378']
    nosy_count = 11.0
    nosy_names = ['gvanrossum', 'tim.peters', 'rhettinger', 'terry.reedy', 'gregory.p.smith', 'Mark.Shannon', 'eric.snow', 'serhiy.storchaka', 'pablogsal', 'brandtbucher', 'BTaskaya']
    pr_nums = ['25090']
    priority = 'normal'
    resolution = None
    stage = 'patch review'
    status = 'open'
    superseder = None
    type = 'performance'
    url = 'https://bugs.python.org/issue43684'
    versions = ['Python 3.10']

    @gvanrossum
    Copy link
    Member Author

    I'm lining up some PRs (inspired by some of Mark Shannon's ideas) that add new opcodes which are straightforward combinations of existing opcodes. For example, ADD_INT is equivalent to LOAD_CONST + BINARY_ADD, for certain small (common) integer constants.

    Each of these adds only a minor speedup, but after a dozen or so of these the speedup is (hopefully) significant enough to warrant the opcode churn.

    @gvanrossum gvanrossum added 3.10 only security fixes performance Performance or resource usage labels Mar 31, 2021
    @serhiy-storchaka
    Copy link
    Member

    We usually rejected such propositions unless there were evidences of significant effect on performance. I myself withdrawn several my patches after analyzing statistics.

    What percent of opcodes will be ADD_INT in the compiled bytecode? What percent of executed opcodes will be ADD_INT? Compare it with other opcodes, in particularly BINARY_ADD. What are microbenchmark results?

    @gvanrossum
    Copy link
    Member Author

    Microbench results for ADD_INT:
    #25090 (comment)

    I have a few others like this lined up -- Mark recommended submitting them one at a time to make review easier.

    My philosophy here (which I learned from Tim Peters in the early 2000s) is that even though each individual improvement has no measurable effect on a general benchmark (as shown in the same comment), the combined effect of a number of tiny improvements can be significant.

    We have reached a point where there are very few individual changes possible that still have a significant effect, but we should not use this as an excuse to stop trying to improve performance.

    I have mislaid the results that made me think this particular opcode is a good candidate; I will go back to reproduce that result and get back to you here.

    @tim-one
    Copy link
    Member

    tim-one commented Apr 2, 2021

    """
    My philosophy here (which I learned from Tim Peters in the early 2000s) is that even though each individual improvement has no measurable effect on a general benchmark (as shown in the same comment), the combined effect of a number of tiny improvements can be significant.
    """

    And sometimes more so than the naïve sum of their individual contributions! Although this was often much clearer on older architectures and compilers (both more predictable).

    Indeed, in the old days I routinely checked "optimizations" in that _slowed_ microbenchmarks, provided they reduced the work on the critical path, and didn't make the code markedly harder to follow (but reducing the critical path usually makes the code clearer!).

    Because, sooner or later, compilers will get smart enough to see what I saw, and generate better code. And then these can compound. Like "oh! these three temp variables don't actually need to be materialized at all anymore, and suddenly I have few enough that do need to be materialized that _all_ of them can live in fast registers...". So, at some point, the next seemingly insignificant optimization checked in could _appear_ to have an "inexplicable" benefit, by breaking the bottleneck on _some_ inscrutable HW resource overtaxed on the critical path by the original code.

    So optimize for what a smart compiler will eventually do, not what they happen to do today ;-) Profile-guided optimization was a major step in that direction.

    Saddest to me: layers of indirection introduced to support marginal features, because "well, they don't slow down pybench by much at all". That's the opposite: over time, they can compound to do worse damage than the sum of their individual burdens.

    In the end, Everything Counts™.

    @gvanrossum
    Copy link
    Member Author

    Statically, in the stdlib, about 1 in a 1000 opcodes is an ADD_INT candidate.

    I'll have to do some more work to come up with a dynamic count.

    @rhettinger
    Copy link
    Contributor

    Statically, in the stdlib, about 1 in a 1000 opcodes
    is an ADD_INT candidate.

    I would expect that ADD_INT would typically occur in loops, so even if it is a rare opcode statically, it will really count when it used.

    @gvanrossum
    Copy link
    Member Author

    So dynamically, I see about 0.5% of all opcodes are ADD_INT, so higher than statically (which was 0.1%) but still small enough that I'm abandoning work on ADD_INT.

    I've got a branch where I'm adding POP_JUMP_IF_NONE and POP_JUMP_IF_NOT_NONE (which capture if x is None, if x is not None) but the static count there is still pretty low, around 0.2% statically. These numbers are remarkably stable -- I get 0.2% exactly for mypy, 0.18% for the stdlib. (A bit more in the azure-cli source code: 0.35%.)

    I expect that the dynamic count here would be a bit higher too: without running it, I'd expect around 1%, if we take it that the average loop does about 5 iterations -- my results for ADD_INT would lead to that conclusion, and this was independently verified by an Instagram blog post from 2017. (https://instagram-engineering.com/profiling-cpython-at-instagram-89d4cbeeb898, search for "loopyness")

    Is that enough to persevere? I really don't know.

    I have a few other potential opcode combinations, RETURN_CONST and RETURN_NONE, but since those (by definition) don't occur in a loop, I'm even less optimistic about the value of those.

    We really should optimize things like LOAD_FAST + LOAD_FAST, which occurs much more frequently. However the way to encode that combination is pretty gross, and I'd rather do that another time.

    @serhiy-storchaka
    Copy link
    Member

    Interesting. What code did you use to collect statistics? I used patches in bpo-27255. Perhaps it is worth to add an optionally compiled code for collecting dynamic opcode statistics permanently.

    We have around 125 opcodes, so 0.5% looks too small. But what is the range of that opcode? How many opcodes are more frequent? And what is the range of BINARY_ADD for comparison.

    RETURN_CONST or RETURN_NONE was already discussed before. They can save few bytes in bytecode, but they are never used in tight loops and the time difference is dwarfed by the frame creation overhead in any case.

    For frequently used pairs of opcodes we have the PREDICT() macros which reduces overhead.

    @gvanrossum
    Copy link
    Member Author

    Interesting. What code did you use to collect statistics?

    For static statistics I wrote my own script: https://github.com/python/cpython/pull/25090/files#diff-994d3592c951c78cbe71084d562590d1507ddfed767e2ec040f5e2610845a11c. I can add that to Tools permanently (I have a slightly improved version that can look for different things).

    I used patches in bpo-27255. Perhaps it is worth to add an optionally compiled code for collecting dynamic opcode statistics permanently.

    We have that already: compile with -DDYNAMIC_EXECUTION_PROFILE -DDXPAIRS. There's a script (Tools/scripts/analyze_dxp.py) that works with such data, but it's primitive. Eric is working on improvements; I added my own hacky script to run an app and collect the data.

    I collected a few different sets of statistics: static stats for the stdlib and for mypy, dynamic stats for running mypy and a few of the benchmarks in pyperformance. Eric and I have plans to do this more systematically; we'll then publish our tools and results.

    @brandtbucher
    Copy link
    Member

    I collected a few different sets of statistics: static stats for the stdlib and for mypy, dynamic stats for running mypy and a few of the benchmarks in pyperformance.

    I'm sure you've considered this, but I'd be *very* careful using opcode stats from the official benchmarks to inform work on these improvements. It's probably better to stick to in-sample data from the stdlib / mypy / etc. and use the benchmarks *only* for measuring out-of-sample results.

    Otherwise we may just end up hacking the benchmark suite. :)

    @gvanrossum
    Copy link
    Member Author

    Yeah, I've received many warnings about benchmark hacking already, and we
    won't fall to that.

    For static statistics, it's easy enough to either download a thousand of
    the most popular projects from GitHub or PyPI and just try to compile all
    the .py files found there.

    For dynamic statistics, the problem is always to come up with a
    representative run. I know how to do that for mypy (just running mypy over
    itself should be pretty good), but we can't do that for random projects
    from PyPI or GitHub -- not only do I not want to run 3rd party code I
    haven't eyeballed reviewed carefully first, but usually there's no way to
    know how to even run it -- configuration, environment, input data all need
    to be fixed, and dependencies need to be installed.

    Running test suites (whether it's the stdlib tests or some other set of
    unit tests for some other code base) is likely to give skewed results,
    given how people often write tests (trying to hit each line of code at
    least once but as few times as possible so the tests run quickly). This is
    where benchmarks provide some limited value, since they come with a
    standard way to run non-interactively.

    I wish it was the case that static stats can be a good proxy for dynamic
    stats, since then we could just compute static stats for a large amount of
    3rd party code, but I'm not too sure of that. So I would like to have more
    examples that we can measure both statically and dynamically -- if the
    dynamic numbers look sufficiently similar to the static numbers, we can see
    that as an indication that static numbers are a good indicator, and if they
    look different, it means that we're going to have to collect more examples
    in order to make progress.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    @iritkatriel
    Copy link
    Member

    We now have POP_JUMP_IF_NONE and LOAD_FAST__LOAD_FAST. So most of the ideas here are implemented, and a few of the others were rejected.

    If nobody will object I will close this as completed.

    @iritkatriel iritkatriel added the pending The issue will be closed if no feedback is provided label Sep 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.10 only security fixes pending The issue will be closed if no feedback is provided performance Performance or resource usage
    Projects
    None yet
    Development

    No branches or pull requests

    6 participants