Skip to content

JIT: improve memory allocation #119730

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
diegorusso opened this issue May 29, 2024 · 6 comments
Open

JIT: improve memory allocation #119730

diegorusso opened this issue May 29, 2024 · 6 comments
Assignees
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage topic-JIT type-feature A feature request or enhancement

Comments

@diegorusso
Copy link
Contributor

diegorusso commented May 29, 2024

Feature or enhancement

Proposal:

The issue #116017 explains already what the problem is with memory allocation used by the JIT.

To give more data point, I decided to debug this a little bit further, put some debugging info in the _PyJIT_Compile and then ran a pyperformance run.
The debugging info are around the memory allocated and the padding used to align it to the page size.
The function has been called 1288249 times and this is the ratio between the actual memory allocated and the padding due to 16K (on MacOS) page size:

  • Total Padding size: 16,490,764,792
  • Total Code/Data size: 6,737,241,608

71% of the memory allocated is wasted in padding whilst only 29% is being used by data. There is an indication that memory needed for these objects is usually much smaller than the page size.

This is a brain dump from @brandtbucher to help out with the implementation:

for 3.14 we'll probably need to look into some sort of slab allocator that will let us share pages between executors. We can allocate by either batching the compiles or stopping the world to flip the permission bits, and then deallocate by maintaining refcounts of each page or something. [...]
One benefit that could come with an arena allocator is the ability to JIT a bunch of guaranteed-in-range trampolines for long jumps to library/C-API calls, rather than needing to create a ton of redundant in-line trampolines inline in the trace (or using global offset table hacks). That should save us memory and speed things up, I think.

Has this already been discussed elsewhere?

I have already discussed this feature proposal on Discourse

Links to previous discussion of this feature:

This has been discussed with Brandt via email and in person at PyCon 2024.

@diegorusso diegorusso added the type-feature A feature request or enhancement label May 29, 2024
@diegorusso diegorusso changed the title JIT: Improve memory allocation JIT: improve memory allocation May 29, 2024
@terryjreedy
Copy link
Member

This seems to be the Discourse discussion
https://discuss.python.org/t/jit-mapping-bytecode-instructions-and-assembly/50809

@mdboom mdboom added the performance Performance or resource usage label May 29, 2024
@diegorusso
Copy link
Contributor Author

diegorusso commented May 29, 2024

This seems to be the Discourse discussion https://discuss.python.org/t/jit-mapping-bytecode-instructions-and-assembly/50809

@terryjreedy that Discourse discussion is more for this issue: #118467 @tonybaloney did an initial implementation to dump the JIT code of an executor and that discussion is for a proposal to dump the JIT code associated with micro ops.

This issue instead is targeting on how the JIT is allocating memory at runtime. At the moment every object is allocated to a new page, there is a lot of padding for new page alignment.

@brandtbucher brandtbucher added interpreter-core (Objects, Python, Grammar, and Parser dirs) 3.14 bugs and security fixes labels May 29, 2024
@brandtbucher
Copy link
Member

brandtbucher commented May 29, 2024

Thanks for this great summary and issue! Yeah, I think this can progress in a few stages:

  • Carve out pages from a single large slab of memory (for free-threading-safety reasons we'll probably want to future-proof this by giving each thread its own slab), but still keeping executors on their own pages. The executors free their own pages when they are deallocated, as they do now (this can happen safely in any thread, not just the allocating thread).
  • Rather than freeing the pages, we may want to reuse them. Could be worth exploring once the allocator exists.
  • Then, start using the beginning of these slabs for things like trampolines, to reduce duplication amongst traces. We'll probably want some sort of (thread-safe!) refcount on the whole slab to keep from leaking this memory if every trace on the page is freed.
  • Finally, the hard part: put several executors on the same page. I think batching the compiles makes the most sense, since it means we don't have to stop the world for every compilation, but there might be other schemes that make sense. Then we need to maintain (thread-safe!) refcounts for each page, to make sure the memory is reclaimed once all traces on a page die.

I can get the ball rolling on step one, and then we can iterate from there.

@diegorusso
Copy link
Contributor Author

Hello, thanks for laying an implementation plan. I was discussing with a colleague and he raised a couple of observations about the last point.
An alternative to batching compiles might be to take advantage of hardware features on recent Intel and Apple CPUs that allow multiple threads to have different permissions for the same page. For Intel there is memory protection keys and Apple has pthread_jit_write_protect_np.
With this approach jit.c would set its thread permissions for that page range to RW before emitting the code and then toggle it back to RX afterwards, and this wouldn't affect another thread that might be concurrently executing another trace on that page. This also avoid the overhead of calling mprotect for each compile which can be significant if there are many running threads.
For systems without these hardware features we could either fall back to allocating JIT memory at a page granularity or perhaps multi-map the JIT pages with separate RW and RX mapping of the same physical pages, the RW mapping would be unmapped after the JIT has finished writing to that page.

Thoughts?

@brandtbucher
Copy link
Member

Ah, neat, I didn't know Intel/AMD had hardware protection keys! Sounds like that's a good plan then. I agree that falling back to one trace per page on other platforms makes the most sense.

@mdboom
Copy link
Contributor

mdboom commented Jun 3, 2024

take advantage of hardware features on recent Intel and Apple CPUs that allow multiple threads to have different permissions for the same page.

How "recent" are we talking? We should be aware of the additional cost of two behaviors / code paths for this, especially in terms of testing.

@picnixz picnixz removed the 3.14 bugs and security fixes label Dec 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage topic-JIT type-feature A feature request or enhancement
Projects
None yet
Development

No branches or pull requests

6 participants