Skip to content

Quick test to see if removing the tier2 interpreter speeds up tier1 #117908

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 9 commits into from

Conversation

gvanrossum
Copy link
Member

@gvanrossum gvanrossum commented Apr 15, 2024

Because this is just a quick test, I used a hack to define the crucial env var, _Py_NOTIER2, by hard-coding it in Makefile.pre.in. If this shows enough of an improvement for Tier 1 that we'd like to have this option, that hack should be replaced with a new configure option. (Adding configure options looks like black magic to me, and IIRC requires using a Docker image to build, so I'm putting that off until after we've seen benchmark results.)

Because of that hack, this fails all the JIT tests. I'll just be ignoring that.

There's also a weird test failure in the WASI build, which I'll also ignore unless we decide to move ahead with this.

@gvanrossum
Copy link
Member Author

The failure on WASI of test_trashcan_16602 is suspect because this PR changes the stack frame size of _PyEval_EvalFrameDefault.

@gvanrossum
Copy link
Member Author

gvanrossum commented Apr 15, 2024

Benchmark results are in:

  • 1% faster on Linux (but 1% more memory. mostly concentrated in sqlglot and deltablue)
  • 2% slower on Windows

Talked it over with @brandtbucher. We found:

  • On Windows we can probably remove the # pragma optimize("t", off) line (if this flag is defined).
  • There's one use of CALL_STAT_INC (in _PyEvalFramePushAndInit) in ceval.c that previously was accidentally a no-op because of the redefinition of this macro for the Tier2 interpreter. The fix is to move that function (and #undef those macros after the T2 interpreter).

@gvanrossum
Copy link
Member Author

I've fixed the two issues we discovered and am firing off a new benchmark run, only for Win64, without pystats. Since I didn't move the point where the branch forks off main, I'm not sure if this will produce the right results, but it's the thing I'm most curious about. Honestly, the CALL_STAT_INC fix should probably be merged separately.

@gvanrossum
Copy link
Member Author

Here are the Windows benchmark results.

It is now only 1% slower than base -- but still, surprisingly, slower. According to the time plot, both richards and richards_super are 9% slower. What's up with that??? This occurs only on Windows; it was present on the last benchmark too (before I disabled the no-optimization pragma).

@mdboom
Copy link
Contributor

mdboom commented Apr 16, 2024

Those Windows results are puzzling, but we know there is just generally more noise in the Windows results. Maybe this is a fool's errand, but we could try running them again to get a sense of the range of noise in the results?

@gvanrossum
Copy link
Member Author

Those Windows results are puzzling, but we know there is just generally more noise in the Windows results. Maybe this is a fool's errand, but we could try running them again to get a sense of the range of noise in the results?

Hm, both runs showed similar results (in particular, both Richards being much slower). Mark's hunch is that by making the function larger, we disabled some compiler optimization that was actually a pessimization.

While I'm still curious if there's a smoking gun in Richards, I am leaning towards not merging this PR.

@mdboom
Copy link
Contributor

mdboom commented Apr 16, 2024

While I'm still curious if there's a smoking gun in Richards, I am leaning towards not merging this PR.

Yes, a bit frustratingly, it's not a clear win.

@brandtbucher
Copy link
Member

I personally still think that it makes sense to make the tier two interpreter a build-time option (like the JIT) instead of a run-time option. Even if the perf impact is negligible, realistically the only people who are using tier two are building their own interpreter anyways, and it simplifies a lot of tricky edge-cases when testing tier two (for example, with subprocesses not picking up the environment variable or command-line flag).

I'm not sure that we have a good argument for shipping all of tier two when pretty much nobody will be using it.

@gvanrossum
Copy link
Member Author

I'm not sure that we have a good argument for shipping all of tier two when pretty much nobody will be using it.

Agreed. Then maybe I can put more stuff inside #ifndef _Py_NOTIER2, e.g. the contents of all the optimizer* files.

@gvanrossum
Copy link
Member Author

If we were to go ahead with that idea, the configure option is misnamed -- it should probably be something like --enable-slow-tier2-interpreter, the CPP variable should be named _Py_TIER2_INTERPRETER and have the opposite sense. And we might have a separate CPP variable _Py_TIER2 that gets set by that and by the JIT flag.

I'm not sure if I care to do all that work, honestly.

@gvanrossum
Copy link
Member Author

gvanrossum commented Apr 17, 2024

@gvanrossum
Copy link
Member Author

I think I'd like to rethink this after gh-116970 (moving the Tier 2 interpreter out into its own function). I'd like to almost completely skip the optimizer*.c files in that case, with the possible exception of PyUnstable_Replace_Executor, PyUnstable_SetOptimizer, PyUnstable_GetOptimizer, and PyUnstable_GetExecutor, which should allow 3rd party JITs to use this mechanism (although those aren't very useful unless we also keep something to trigger optimization that doesn't use PEP 523).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants