Skip to content

bpo-46406: Faster single digit int division. #30626

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jan 23, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
The integer division ``//`` implementation has been optimized to better let the
compiler understand its constraints. It can be 20% faster on the amd64 platform
when dividing an int by a value smaller than ``2**30``.
36 changes: 26 additions & 10 deletions Objects/longobject.c
Original file line number Diff line number Diff line change
Expand Up @@ -1616,25 +1616,41 @@ v_rshift(digit *z, digit *a, Py_ssize_t m, int d)
in pout, and returning the remainder. pin and pout point at the LSD.
It's OK for pin == pout on entry, which saves oodles of mallocs/frees in
_PyLong_Format, but that should be done with great care since ints are
immutable. */
immutable.

This version of the code can be 20% faster than the pre-2022 version
on todays compilers on architectures like amd64. It evolved from Mark
Dickinson observing that a 128:64 divide instruction was always being
generated by the compiler despite us working with 30-bit digit values.
See the thread for full context:

https://mail.python.org/archives/list/[email protected]/thread/ZICIMX5VFCX4IOFH5NUPVHCUJCQ4Q7QM/#NEUNFZU3TQU4CPTYZNF3WCN7DOJBBTK5

If you ever want to change this code, pay attention to performance using
different compilers, optimization levels, and cpu architectures. Beware of
PGO/FDO builds doing value specialization such as a fast path for //10. :)

Verify that 17 isn't specialized and this works as a quick test:
python -m timeit -s 'x = 10**1000; r=x//10; assert r == 10**999, r' 'x//17'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Asserting that division by 10 "works" is irrelevant when the test is actually dividing by 17. But it's a timing statement, so asserting that the division works is beside the point anyway. The point is to get the timing, so

python -m timeit -s 'x = 10**1000' 'x//17'

would be best. And 17 isn't of real interest anyway. Use 10 instead. That's the value we've actually seen special-cased in PGO builds. 17 only came up when Mark temporarily fudged a PGO-build input test to pick on 17 instead of 10, just to see what would happen.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The command is just a paste of what i was using in my terminal as a quick unittest and microbenchmark. The assert in the setup code is a unittest. It's just easy to write a concise test using 10 and what value is used there isn't really relevant to what is used in the x//17 microbenchmark expression itself.

*/
static digit
inplace_divrem1(digit *pout, digit *pin, Py_ssize_t size, digit n)
{
twodigits rem = 0;
digit remainder = 0;

assert(n > 0 && n <= PyLong_MASK);
pin += size;
pout += size;
while (--size >= 0) {
digit hi;
rem = (rem << PyLong_SHIFT) | *--pin;
*--pout = hi = (digit)(rem / n);
rem -= (twodigits)hi * n;
}
return (digit)rem;
twodigits dividend;
dividend = ((twodigits)remainder << PyLong_SHIFT) | pin[size];
digit quotient;
quotient = (digit)(dividend / n);
remainder = dividend % n;
pout[size] = quotient;
}
return remainder;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Patterns like *--pin and *--pout are ubiquitous in this file. so converting that style to pin[size] (etc) is jarring in context. The patch would be better if it stuck to changing what needed to be changed.

It would also be good to add a comment explaining why computing both "/" and "%" is faster on most boxes now than doing what the code originally did. "/" and "%" are both very expensive if they're done in isolation, which is presumably why the original code did a "*" and "-" instead of "%".

Copy link
Member Author

@gpshead gpshead Jan 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to add the missing comment to this effect while you're in this file in your #30856.

[] vs pointer arithmetic would make me want to rerun benchmarks and examine generated code out of curiosity so i'll leave that alone for now - it might be worth investigating and would be consistent. I just took what Mark had written on python-dev and ran with it.

}


/* Divide an integer by a digit, returning both the quotient
(as function result) and the remainder (through *prem).
The sign of a is ignored; n should not be zero. */
Expand Down