-
-
Notifications
You must be signed in to change notification settings - Fork 31.9k
bpo-45367: Specialize BINARY_MULTIPLY #28727
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🤖 New build scheduled with the buildbot fleet by @sweeneyde for commit 300a0ca 🤖 If you want to schedule another build, you need to add the ":hammer: test-with-buildbots" label again. |
The failing |
On
and on
|
Looks promising. Apologies for the merge conflicts from #28723. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Dennis, this looks good and I think it's almost ready to merge :). Just needs some stable benchmarks on non-Windows platforms.
goto success; | ||
} | ||
else { | ||
SPECIALIZATION_FAIL(BINARY_MULTIPLY, SPEC_FAIL_OTHER); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this should be refined further. In another PR, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. I don't think it will touch pyperformance, but it seems that in our own test suite almost 10% aren't hits #28727 (comment).
This probably varies wildly, I'd imagine test_string
has lots of str * num
too.
Oh gosh I just realized I'd read the entire PR as BINARY_ADD
not BINARY_MULTIPLY
. Please ignore my delusional ramblings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of the non-hits are deferred, not specialization failure, so I think that's because the nature of a lot of test code is to only be run once, without many tight loops.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing that out Dennis. I can't believe I missed that :). That means nearly 100% actual specialization on hot code. Hooray!
Off-topic: I've recently wondered if pyperformance
is somewhat out of touch (no offence intended to its contributors). Many of its benchmarks (nbody, nqueens, etc.) have been found by the JS folks to not be realistic. The Pyston benchmark suite seems way more comprehensive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"deferred" means that the ADAPTIVE
instruction is counting down until its next specialization attempt.
This only happens after a specialization failure.
Look at the numbers. deferred ≈ sum(specialization-failures) * ADAPTIVE_CACHE_BACKOFF
It does show speedups for the benchmarks we would expect, but there are significant slowdowns for some other benchmarks; notably the pickle benchmarks. |
I've personally disregarded anything from pyperformance in the range of +-1.05x. That said, I wholly believe nbody sped up, I just don't trust pickle slowing down. There's no |
I ran
I'm guessing specializing int/float mixtures would be beneficial. I wonder if it's better to have a single slightly-more-complex opcode or add more opcodes. I was thinking something like TARGET(BINARY_MULTIPLY_FLOAT) {
PyObject *left = SECOND();
PyObject *right = TOP();
double dleft, dright;
if (PyFloat_CheckExact(left)) {
dleft = ((PyFloatObject *)left)->ob_fval;
if (PyFloat_CheckExact(right)) {
dright = ((PyFloatObject *)right)->ob_fval;
}
else if (PyLong_CheckExact(right) && IS_MEDIUM_VALUE(right)) {
dright = (double)medium_value(right);
}
else {
DEOPT_IF(1, BINARY_MULTIPLY);
}
}
else if (PyLong_CheckExact(left) && IS_MEDIUM_VALUE(left)) {
DEOPT_IF(!PyFloat_CheckExact(right), BINARY_MULTIPLY);
dleft = (double)medium_value(left);
dright = ((PyFloatObject *)right)->ob_fval;
}
else {
DEOPT_IF(1, BINARY_MULTIPLY);
}
STAT_INC(BINARY_MULTIPLY, hit);
record_hit_inline(next_instr, oparg);
PyObject *prod = PyFloat_FromDouble(dleft * dright);
SET_SECOND(prod);
Py_DECREF(right);
Py_DECREF(left);
STACK_SHRINK(1);
if (prod == NULL) {
goto error;
}
DISPATCH();
} |
The most recent change gets the results of
|
What types are you specializing for, and why? Please read https://github.com/python/cpython/blob/main/Python/adaptive.md, specifically https://github.com/python/cpython/blob/main/Python/adaptive.md#gathering-data |
My thinking was that two To clarify, are you advising that the increased complexity of the opcode is not worth the increased percentage of instructions specialized? Is the solution to
I am struggling to run the whole pyperformance suite at once, but as an example, bm_spectral_norm goes from
to
when incorporating I suppose we could gather specialization stats for what happens if |
nbody sped up, everything else looks in the realm of noise for pyperformance: https://gist.github.com/Fidget-Spinner/6fd3149fc82497d028b02046765ba8d8 |
Yes.
I don't think there is sufficient evidence that these are useful. At least, not yet.
The deferred operations only represent about 1%. Not worth worrying about. |
In general, it is not the number of branches that matters, as much as their predictability. |
I am beginning to think this PR may not be worth it at all: even on microbenchmarks, after PGO, there's not that much benefit: Program: from pyperf import Runner
runner = Runner()
runner.timeit(
"int: x*x",
setup="from itertools import repeat; it = repeat(1, 1_000_000)",
stmt="for x in it: x*x")
runner.timeit(
"float: x*x",
setup="from itertools import repeat; it = repeat(1.0, 1_000_000)",
stmt="for x in it: x*x")
runner.timeit(
"int: x*...*x",
setup="from itertools import repeat; it = repeat(1, 1_000_000)",
stmt="for x in it: x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x")
runner.timeit(
"float: x*...*x",
setup="from itertools import repeat; it = repeat(1.0, 1_000_000)",
stmt="for x in it: x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x*x") Results from PGO on WSL:
Results from pgo using build.bat (MSVC):
|
@sweeneyde The speedups are worthwhile for less than 100 additional lines of code, IMO. The speedups on the micro benchmark shows that it works. Don't forget that |
Okay, that makes sense, thanks. Re-opened. |
🤖 New build scheduled with the buildbot fleet by @markshannon for commit 31c3bb9 🤖 If you want to schedule another build, you need to add the ":hammer: test-with-buildbots" label again. |
Buildbot failures seem to be the usual suspects. |
|
https://bugs.python.org/issue45367