-
Notifications
You must be signed in to change notification settings - Fork 51
Measure Windows performance (and improve if lacking) #321
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Would it be a viable option to use Clang instead of MSVC for Windows? |
I have got it compiling under clang on windows in debug build |
FWIW my own measurements on Windows show much better perf than that, but are inconclusive due to noise (we're waiting for a dedicated perf machine running Windows). See #315 (comment) . |
@zooba Do you see a problem with using clang instead of MSVC on Windows, assuming we can get it working? |
Wouldn't that mean that anyone building C extensions would also have to switch over to Clang? If so, that would require some coordination (e.g. cibuildwheel, conda-forge, numpy/scipy) to make such a switch and support two compilers if you want to make C extensions available for multiple Python versions. |
I don't know. I have no experience with this topic at all, which is why I'm asking @zooba for advice. @kumaraditya303 Can you point to your branch here? |
Yeah, that's a massive ecosystem change that we're not ready for. It almost certainly will affect the ABI in some way, though it might be manageable, and it'll definitely impact/break any build scripts that implicitly or explicitly assume MSVC. As I work through ARM64 support, I'm finding that is "many"... (unless, of course, it turns out to be perfectly safe to link between the two compilers, and then it'll only affect static lib users, which is practically nobody AFAIK). Figuring out for certain if the size of ceval is causing it to de-optimise and then sensibly breaking it up is much easier. |
FWIW @neonene came up with a PR that temporarily enables a flag that shows the inlining decisions. I'll look into this. |
I have a branch python/cpython@main...sweeneyde:ops_to_funcs that converts several of the longer opcodes into functions, in case that is useful for comparison. |
Macro benchmarks' improvements are unstable for me in the range of 8~20% compared to 3.10.1+. When building with PGO (only), changing the definition of |
Regarding Sample patch:
x64 PGO details
Relase builds:
It is possible another commit reverses the slower/faster.
|
Relase vs PGO, leaving
|
3.10.1 (x64) could get a 17% gap between release(/Ob3) and PGO.
|
I've figured out how to get assembly code from after the PGO/LTO passes run, but the output is hard to understand -- there are almost no symbols in the resulting file and absolutely no line numbers. I have to do this for python311.dll and the output is 630,000 lines. The function Anyway, the key thing I learned here is
I was hoping that looking at that disassembly would answer questions like "did the switch get compiled to the optimal form" but I can barely even find where the switch starts. :-( I found one trick that seems to help:
Using search on the raw disassembly I was able to confirm that the code there looked the same. I now believe that the top of the switch compiles to this code (on my machine, with VS 2019, in PGO+LTO mode):
(Which is weird, because that last line is not directly following the switch.) Unfortunately, I don't know x86/x64 assembly, so I have no idea what this means. |
Okay, the Disassembly window works a bit differently than I had understood before. It does not necessarily display all source lines, and/or it doesn't display them in order. I guess that's why it has to be so deeply integrated with the debugger. A more illustrative snippet:
That corresponds to this source code:
(I'm not sure why the Tracing through the Also note that I did all this in the PR branch where I am experimenting with turning INCREF/DECREF back into macros, in ceval.c only. |
@neonene I am not ignoring you, far from it! I would like to dig into the There are only two calls to it in ceval.c. How can we prove that the optimizer really does get distracted by those? Note that the two are not similar: One call is here:
which is only unreachable because the bytecode compiler never generates this opcode -- but the optimizer cannot know that (though PGO can deduce it, since it's never executed). The other is here:
Here the optimizer should know the code is really unreachable, if all cases indeed do end with Can we use this knowledge somehow? Maybe manually put calls to PS. Can you link us to your branch? |
Could anyone run benchmarks on the same version (timing) without code change, putting my posts aside? My English reading/writing is too slow to reply about Comparing Relese with PGO should be easier than 3.10PGO vs 3.11PGO.
|
Any commit is fine. I don't mean the same version as mine. |
@neonene I still don't have the capacity to run benchmarks without a lot of noise. The rest here is lab notes: I learned a bit of x64 assembler and I think I have a better grip on how
Let me narrate that. (Register names like EAX are aliases for the lower 32 bits of RAX -- the latter is a 64-bit register.) At the start, register EDI contains the opcode and ESI has the oparg (the result of the
So we have two tables, call them
The critical point here is that we have two memory loads, where the second one is dependent on the first one:
So this is very far from the "computed goto" we have with GCC and Clang, not is it anywhere near the "use opcode as jump index" we were hoping for (that would be using Given enough time and monkeys I can now answer any question about optimization. :-) Another observation relates to Next I'm going to see what things look like on the main branch (so far I used my "inline" branch from python/cpython#32387). |
With a Release build from main, the switch(op) code looks pretty much the same. With a Debug build a bit of clarity is obtained, but not much -- now we see this:
This confirms my hunch that the |
Thanks for digging into it! |
So the dispatch is effectively (using gcc syntax): If we were to drop the
then would the compiler drop the @gvanrossum I wonder whether If it turns that |
One thing that is not clear. |
See Guido's comment above:
It looks like it follows the C code pretty literally, jumping up to |
Half-baked idea: see how the different compilers handle a hack-y form of computed goto, where a switch over the opcode labels is stuffed into the #define DISPATCH_GOTO() \
switch (opcode) { \
case NOP: goto TARGET_NOP; \
case RESUME: goto TARGET_RESUME; \
...
} I'm not sure I like it, though, since small changes (in the code or the compiler) could easily result in code size explosion. |
I added dummy cases for all non-existent opcodes, but sadly this did not convince the compiler to skip the indirection table. The switch code looks the same as before, the only thing different is that there are a few other instructions between the first and the second memory load:
I don't think this is worth the hassle. |
This worries me because we'd have 150+ copies of that switch. At this point I don't think it's very useful to try and argue with either the compiler, PGO, LTO, or the CPU -- the one area where I think we have some leverage is some common macros that fail to be inlined (IS_TYPE, DECREF, and load_atomic_relaxed[int32]). For that I think we need python/cpython#32387. |
(So I tried out the dummy cases but it didn't get rid of the offset_table, it just moved the memory access up a bit. See previous comment.)
Here
That's just a constant load, and while I suspect (since this is in a DLL) it's actually IP-relative, the CPU probably has a dedicated adder to compute that. So my guess is that the compiler writers know what they are doing, using 32-bit offsets. But they still want to avoid dummy entries there, hence the offset_table. Hmm, maybe we can avoid that by making all the cases actually do something different. |
Nope, it still uses the offset_table. I'm giving up on this. Sorry to get your hopes up during our meeting earlier today. |
Could you publish the branches with your code and the resulting assembler output, so that others can pick up where you left off? |
I will try to create branches later this week. |
Looking at this again I realized that
|
Well I'll be darned. Now the code effectively boils down to this, where initially
(I left some instructions out that are interleaved but don't touch any of these and seem to be related to flushing registers to the stack or moving them around.) Crucially, there's nothing like this to be found, which I found in last week's investigations:
(which was the "compact_table" from above). Thus the new code is now equivalent to
which is one memory load less than before (where opcode was an int). Let me see if we actually need the dummy cases. |
Yeah, with just a
Next I'll investigate whether the dummy cases need to contain any code. |
Fortunately, the dummy cases don't need to do anything. I will prepare a separate PR for this. |
Apparently a switch on an 8-bit quantity where all cases are present generates a more efficient jump (doing only one indexed memory load instead of two). See faster-cpython/ideas#321 (comment)
Here's the PR. (Code is included in comments above.) python/cpython#91718 |
This approach makes |
Apparently a switch on an 8-bit quantity where all cases are present generates a more efficient jump (doing only one indexed memory load instead of two). So we make opcode and use_tracing uint8_t, and generate a macro full of extra `case NNN:` lines for all unused opcodes. See faster-cpython/ideas#321 (comment)
|
I've done the best measurements I can on my corporate laptop, and I conclude that on Windows 3.11 is now ~20% faster than 3.10. This is main at revision 2f233fceae. Things I did to stabilize the benchmarks:
"I close the issue." |
PS. I now also benchmarked 3.11 just before my first Windows perf commit (i.e., just before GH-91719). That one was only 7% faster than 3.10.4, and current head is 13% faster than that. So one or both of these did have a big effect! |
I ran one more set of benchmarks, and conclude that the switch improvement (GH-91719) contributed 3% and the macro-fied inline functions (GH-89279) contributed the remaining 10%. This is assuming none of the intervening commits contributed performance. The list of commits including the endpoints:
|
There was a report that on Windows the performance on the PyPerformance benchmark is only 3% better than for 3.10. We should try to confirm or refute this result, and if confirmed, attempt to improve the situation.
Likely culprit is the optimizer (or PGO or LTO) giving up on inlining in the main function in ceval.c. In particular it might be the INCREF/DECREF inline functions. A possible hack would be to turn these back into macros again at the top of ceval.c (if no debugging options are defined).
The text was updated successfully, but these errors were encountered: