-
-
Notifications
You must be signed in to change notification settings - Fork 32.2k
bpo-46406: Faster single digit int division. #30626
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
The integer division ``//`` implementation has been optimized to better let the | ||
compiler understand its constraints. It can be 20% faster on the amd64 platform | ||
when dividing an int by a value smaller than ``2**30``. |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -1616,25 +1616,41 @@ v_rshift(digit *z, digit *a, Py_ssize_t m, int d) | |
in pout, and returning the remainder. pin and pout point at the LSD. | ||
It's OK for pin == pout on entry, which saves oodles of mallocs/frees in | ||
_PyLong_Format, but that should be done with great care since ints are | ||
immutable. */ | ||
immutable. | ||
|
||
This version of the code can be 20% faster than the pre-2022 version | ||
on todays compilers on architectures like amd64. It evolved from Mark | ||
Dickinson observing that a 128:64 divide instruction was always being | ||
generated by the compiler despite us working with 30-bit digit values. | ||
See the thread for full context: | ||
|
||
https://mail.python.org/archives/list/[email protected]/thread/ZICIMX5VFCX4IOFH5NUPVHCUJCQ4Q7QM/#NEUNFZU3TQU4CPTYZNF3WCN7DOJBBTK5 | ||
|
||
If you ever want to change this code, pay attention to performance using | ||
different compilers, optimization levels, and cpu architectures. Beware of | ||
PGO/FDO builds doing value specialization such as a fast path for //10. :) | ||
|
||
Verify that 17 isn't specialized and this works as a quick test: | ||
python -m timeit -s 'x = 10**1000; r=x//10; assert r == 10**999, r' 'x//17' | ||
*/ | ||
static digit | ||
inplace_divrem1(digit *pout, digit *pin, Py_ssize_t size, digit n) | ||
{ | ||
twodigits rem = 0; | ||
digit remainder = 0; | ||
|
||
assert(n > 0 && n <= PyLong_MASK); | ||
pin += size; | ||
pout += size; | ||
while (--size >= 0) { | ||
digit hi; | ||
rem = (rem << PyLong_SHIFT) | *--pin; | ||
*--pout = hi = (digit)(rem / n); | ||
rem -= (twodigits)hi * n; | ||
} | ||
return (digit)rem; | ||
twodigits dividend; | ||
dividend = ((twodigits)remainder << PyLong_SHIFT) | pin[size]; | ||
digit quotient; | ||
quotient = (digit)(dividend / n); | ||
remainder = dividend % n; | ||
pout[size] = quotient; | ||
} | ||
return remainder; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Patterns like It would also be good to add a comment explaining why computing both "/" and "%" is faster on most boxes now than doing what the code originally did. "/" and "%" are both very expensive if they're done in isolation, which is presumably why the original code did a "*" and "-" instead of "%". There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Feel free to add the missing comment to this effect while you're in this file in your #30856. [] vs pointer arithmetic would make me want to rerun benchmarks and examine generated code out of curiosity so i'll leave that alone for now - it might be worth investigating and would be consistent. I just took what Mark had written on python-dev and ran with it. |
||
} | ||
|
||
|
||
/* Divide an integer by a digit, returning both the quotient | ||
(as function result) and the remainder (through *prem). | ||
The sign of a is ignored; n should not be zero. */ | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Asserting that division by 10 "works" is irrelevant when the test is actually dividing by 17. But it's a timing statement, so asserting that the division works is beside the point anyway. The point is to get the timing, so
would be best. And 17 isn't of real interest anyway. Use 10 instead. That's the value we've actually seen special-cased in PGO builds. 17 only came up when Mark temporarily fudged a PGO-build input test to pick on 17 instead of 10, just to see what would happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The command is just a paste of what i was using in my terminal as a quick unittest and microbenchmark. The assert in the setup code is a unittest. It's just easy to write a concise test using 10 and what value is used there isn't really relevant to what is used in the
x//17
microbenchmark expression itself.