Quick test to see if removing the tier2 interpreter speeds up tier1 #117908

gvanrossum · 2024-04-15T18:41:28Z

Because this is just a quick test, I used a hack to define the crucial env var, _Py_NOTIER2, by hard-coding it in Makefile.pre.in. If this shows enough of an improvement for Tier 1 that we'd like to have this option, that hack should be replaced with a new configure option. (Adding configure options looks like black magic to me, and IIRC requires using a Docker image to build, so I'm putting that off until after we've seen benchmark results.)

Because of that hack, this fails all the JIT tests. I'll just be ignoring that.

There's also a weird test failure in the WASI build, which I'll also ignore unless we decide to move ahead with this.

… new configure flag

gvanrossum · 2024-04-15T21:35:38Z

The failure on WASI of test_trashcan_16602 is suspect because this PR changes the stack frame size of _PyEval_EvalFrameDefault.

gvanrossum · 2024-04-15T23:00:10Z

Benchmark results are in:

1% faster on Linux (but 1% more memory. mostly concentrated in sqlglot and deltablue)
2% slower on Windows

Talked it over with @brandtbucher. We found:

On Windows we can probably remove the # pragma optimize("t", off) line (if this flag is defined).
There's one use of CALL_STAT_INC (in _PyEvalFramePushAndInit) in ceval.c that previously was accidentally a no-op because of the redefinition of this macro for the Tier2 interpreter. The fix is to move that function (and #undef those macros after the T2 interpreter).

gvanrossum · 2024-04-16T03:35:17Z

I've fixed the two issues we discovered and am firing off a new benchmark run, only for Win64, without pystats. Since I didn't move the point where the branch forks off main, I'm not sure if this will produce the right results, but it's the thing I'm most curious about. Honestly, the CALL_STAT_INC fix should probably be merged separately.

gvanrossum · 2024-04-16T15:20:00Z

Here are the Windows benchmark results.

It is now only 1% slower than base -- but still, surprisingly, slower. According to the time plot, both richards and richards_super are 9% slower. What's up with that??? This occurs only on Windows; it was present on the last benchmark too (before I disabled the no-optimization pragma).

mdboom · 2024-04-16T16:22:59Z

Those Windows results are puzzling, but we know there is just generally more noise in the Windows results. Maybe this is a fool's errand, but we could try running them again to get a sense of the range of noise in the results?

gvanrossum · 2024-04-16T17:08:44Z

Those Windows results are puzzling, but we know there is just generally more noise in the Windows results. Maybe this is a fool's errand, but we could try running them again to get a sense of the range of noise in the results?

Hm, both runs showed similar results (in particular, both Richards being much slower). Mark's hunch is that by making the function larger, we disabled some compiler optimization that was actually a pessimization.

While I'm still curious if there's a smoking gun in Richards, I am leaning towards not merging this PR.

mdboom · 2024-04-16T18:09:23Z

While I'm still curious if there's a smoking gun in Richards, I am leaning towards not merging this PR.

Yes, a bit frustratingly, it's not a clear win.

brandtbucher · 2024-04-16T18:47:32Z

I personally still think that it makes sense to make the tier two interpreter a build-time option (like the JIT) instead of a run-time option. Even if the perf impact is negligible, realistically the only people who are using tier two are building their own interpreter anyways, and it simplifies a lot of tricky edge-cases when testing tier two (for example, with subprocesses not picking up the environment variable or command-line flag).

I'm not sure that we have a good argument for shipping all of tier two when pretty much nobody will be using it.

gvanrossum · 2024-04-16T22:32:32Z

I'm not sure that we have a good argument for shipping all of tier two when pretty much nobody will be using it.

Agreed. Then maybe I can put more stuff inside #ifndef _Py_NOTIER2, e.g. the contents of all the optimizer* files.

gvanrossum · 2024-04-16T22:37:05Z

If we were to go ahead with that idea, the configure option is misnamed -- it should probably be something like --enable-slow-tier2-interpreter, the CPP variable should be named _Py_TIER2_INTERPRETER and have the opposite sense. And we might have a separate CPP variable _Py_TIER2 that gets set by that and by the JIT flag.

I'm not sure if I care to do all that work, honestly.

gvanrossum · 2024-04-17T00:13:56Z

Latest Windows benchmark: 0% faster. Richards is no longer special.

(What changed: this no longer updates the bitmask/counter for POP_JUMP_IF* opcodes.)

gvanrossum · 2024-04-17T15:34:42Z

I think I'd like to rethink this after gh-116970 (moving the Tier 2 interpreter out into its own function). I'd like to almost completely skip the optimizer*.c files in that case, with the possible exception of PyUnstable_Replace_Executor, PyUnstable_SetOptimizer, PyUnstable_GetOptimizer, and PyUnstable_GetExecutor, which should allow 3rd party JITs to use this mechanism (although those aren't very useful unless we also keep something to trigger optimization that doesn't use PEP 523).

gvanrossum added 5 commits April 15, 2024 11:24

Hard-code _Py_NOTIER2 **temporarily** until I figure out how to add a…

dba4b9b

… new configure flag

Disallow JIT and NOTIER2 flags together

ada980b

If NOTIER2, don't ever call into it from bytecodes.c

2794ead

Remove tier2 interpreter if NOTIER2

aa47106

Fix test_opt (expose _opcode.NOTIER2 and disable tests if set)

2055e5f

gvanrossum added skip issue skip news labels Apr 15, 2024

gvanrossum added 3 commits April 15, 2024 18:02

Leave Windows compiler optimization on if _Py_NOTIER2 is defined

4d33fa2

Fix the problem with CALL_STAT_INC (see issue)

0c195cb

Support a proper ./configure option --disable-tier2

e83be54

gvanrossum mentioned this pull request Apr 16, 2024

GH-118036: Fix a bug with CALL_STAT_INC #117933

Merged

Also disable updating the bitmask cache for PUP_JUMP_IF*

ced0ca9

gvanrossum closed this Apr 17, 2024

gvanrossum mentioned this pull request Apr 27, 2024

gh-118335: Configure Tier 2 interpreter at build time #118339

Merged

gvanrossum deleted the disable-tier2 branch May 1, 2024 14:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quick test to see if removing the tier2 interpreter speeds up tier1 #117908

Quick test to see if removing the tier2 interpreter speeds up tier1 #117908

gvanrossum commented Apr 15, 2024 •

edited

Loading

gvanrossum commented Apr 15, 2024

gvanrossum commented Apr 15, 2024 •

edited

Loading

gvanrossum commented Apr 16, 2024

gvanrossum commented Apr 16, 2024

mdboom commented Apr 16, 2024

gvanrossum commented Apr 16, 2024

mdboom commented Apr 16, 2024

brandtbucher commented Apr 16, 2024

gvanrossum commented Apr 16, 2024

gvanrossum commented Apr 16, 2024

gvanrossum commented Apr 17, 2024 •

edited

Loading

gvanrossum commented Apr 17, 2024

Quick test to see if removing the tier2 interpreter speeds up tier1 #117908

Quick test to see if removing the tier2 interpreter speeds up tier1 #117908

Conversation

gvanrossum commented Apr 15, 2024 • edited Loading

gvanrossum commented Apr 15, 2024

gvanrossum commented Apr 15, 2024 • edited Loading

gvanrossum commented Apr 16, 2024

gvanrossum commented Apr 16, 2024

mdboom commented Apr 16, 2024

gvanrossum commented Apr 16, 2024

mdboom commented Apr 16, 2024

brandtbucher commented Apr 16, 2024

gvanrossum commented Apr 16, 2024

gvanrossum commented Apr 16, 2024

gvanrossum commented Apr 17, 2024 • edited Loading

gvanrossum commented Apr 17, 2024

gvanrossum commented Apr 15, 2024 •

edited

Loading

gvanrossum commented Apr 15, 2024 •

edited

Loading

gvanrossum commented Apr 17, 2024 •

edited

Loading