Skip to content

Conversation

swolchok
Copy link
Contributor

@swolchok swolchok commented Sep 17, 2025

Description

nanobind has a similar two-level lookup strategy, added and explained by
wjakob/nanobind@b515b1f . In brief, fast_type_map should be faster because std::type_index identity is based on the name, whereas fast_type_map uses pointer equality.

In this PR I've ported this approach to pybind11.

Performance seems to have improved. Using my patched fork of pybind11_benchmark
(https://github.com/swolchok/pybind11_benchmark/tree/benchmark-updates, specifically commit hash b6613d12607104d547b1c10a8145d1b3e9937266), I run bench.py and observe the MyInt case. Each time, I do 3 runs and just report all 3.

master, Mac: 75.9, 76.9, 75.3 nsec/loop
this PR, Mac: 73.8, 73.8, 73.6 nsec/loop
master, Linux box: 188, 187, 188 nsec/loop
this PR, Linux box: 164, 165, 164 nsec/loop

Note that the "real" percentage improvement is larger than implied by the above because master does not yet include #5824.

Suggested changelog entry:

  • Improved performance of function dispatch and type casting by porting two-level type info lookup strategy from nanobind.

Comment on lines +181 to +185
// REVIEW: do we need to add a fancy hash for pointers or is the
// possible identity hash function from the standard library (e.g.,
// libstdc++) sufficient?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note that the Linux box I used for benchmarking has libstdc++.

…nt for global types

nanobind has a similar two-level lookup strategy, added and explained
by
wjakob/nanobind@b515b1f

In this PR I've ported this approach to pybind11. To avoid an ABI
break, I've kept the fast maps to the `local_internals`. I think this
should be safe because any particular module should see its
`local_internals` reset at least as often as the global `internals`,
and misses in the fast "hint" map for global types fall back to the
global `internals`.

Performance seems to have improved. Using my patched fork of
pybind11_benchmark
(https://github.com/swolchok/pybind11_benchmark/tree/benchmark-updates,
specifically commit hash b6613d12607104d547b1c10a8145d1b3e9937266), I
run bench.py and observe the MyInt case. Each time, I do 3 runs and
just report all 3.

master, Mac: 75.9, 76.9, 75.3 nsec/loop
this PR, Mac: 73.8, 73.8, 73.6 nsec/loop
master, Linux box: 188, 187, 188 nsec/loop
this PR, Linux box: 164, 165, 164 nsec/loop

Note that the "real" percentage improvement is larger than implied by the
above because master does not yet include pybind#5824.
py::detail::get_local_internals_pp_manager().unref();

// now we unref the static global singleton internals
py::detail::get_local_internals_pp_manager().unref();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this intended to unref the non-local internals? looks redundant as written

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whoops, butchered my copy/paste/modify job

auto it = locals.find(tp);
inline detail::type_info *get_local_type_info(const std::type_info &tp,
const local_internals &local_internals) {
const auto &locals = local_internals.registered_types_cpp;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not changed in this PR but I think this function needs a with_internals lock too. Multiple threads could run bindings from the same DSO simultaneously. local_internals doesn't have its own lock and is generally protected by the global internals lock.

You could optimize by making the versions with a local_internals parameter assume internals is already locked. Then get_type_info() could query both internals under a single lock instead of acquiring, releasing, and re-acquiring.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems clearly out of scope for this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Also, doesn't the GIL preclude this sort of thing when present?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems clearly out of scope for this PR.

I figured since you were reworking these functions anyway, I would point out an easy-looking opportunity to improve their thread-safety while you're here. I agree the PR doesn't make current behavior worse and I wasn't trying enforce a wider scope.

(Also, doesn't the GIL preclude this sort of thing when present?)

Yes, but pybind11 supports free-threading mode.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

local_internals doesn't have its own lock and is generally protected by the global internals lock.

function_record_ptr_from_PyObject also accesses local_internals and isn't called under the global internals lock.

local_internals.registered_types_cpp.erase(tinfo->cpptype);
} else {
internals.registered_types_cpp.erase(tindex);
local_internals.global_registered_types_cpp_fast.erase(tinfo->cpptype);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will only erase the fast map entry for the DSO where the type was bound. Other DSOs will preserve their cached entries in the fast map, which will now dangle. Solving that with an ABI break is pretty easy - make a linked list of all the fast maps and have modifications to the slow map invalidate the corresponding entries in the fast maps. (Or since you're taking an ABI break, just put the fast map in global internals.) Solving it without an ABI break will be difficult. The best way I can think of is to accompany each entry you add to the fast map with a keep_alive that will remove it when the targeted type object is being destroyed. That's pretty heavyweight though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for finding the hole in this strategy! I wonder if there's a way to accommodate developing things in anticipation of a future ABI break without just doubling CI costs.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

without just doubling CI costs.

We used to have a few extra jobs that set PYBIND11_INTERNALS_VERSION=10000000 or similar.

Try git show 662a88cbc16be6ad1698ce73e7a30982119f2fca and look for -DPYBIND11_INTERNALS_VERSION=10000000.

We could do something surgical, like picking out one or a couple jobs each for Linux, macOS, Windows to override the internals version.

Copy link
Contributor

@oremanj oremanj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, wjakob/nanobind#1140 incidentally contains some rework of the nanobind fast type map in case you want to incorporate ideas from that. One fast map per thread (to avoid lock contention) + caching negative lookups. Note that caching negative lookups requires also remembering which type_info pointers you've seen for a given C++ type, so you can invalidate the cache when a new type is bound. (AFAICT, nothing you're doing here precludes getting fancier in that direction later.)

@swolchok swolchok changed the title Add fast_type_map, use it authoritatively for local types and as a hint for global types Add fast_type_map, use it authoritatively for local types and as a hint for global types (ABI breaking) Sep 22, 2025
@rwgk
Copy link
Collaborator

rwgk commented Sep 22, 2025

I think this is fine, assuming we also add some extra test coverage.

@b-pass for visibility — This PR needs a git merge master before a final review though.

/// changed to the new version.
#ifndef PYBIND11_INTERNALS_VERSION
// REMINDER for next version bump: remove loader_life_support_tls
# define PYBIND11_INTERNALS_VERSION 11
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could make the default version dependent on the Python version again (we did this before). Concretely, if we detect Python >= 3.14, #define PYBIND11_INTERNALS_VERSION 12. That'll add to the test coverage without breaking any existing ABIs.

Copy link
Contributor

@b-pass b-pass left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants