-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Avoid heap allocation for function calls with a small number of args #5824
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Didn't fully realize pybind11 continues to support C++11 and C++14, hence failing CI jobs. Can remove std::variant, which I think will also save memory and allow inline size increase. Would be nice to have directional review feedback if folks see this before I get that done, though. |
Awesome.
That'd be great. Could you please move WDYT about a more generic name, e.g. You'll have to update the top-level |
This was feasible mostly because we can take a bunch of implementation shortcuts knowing that the value type is py::handle, which is trivially copyable and trivially destructible, and knowing that we only have to do a limited subset of the vector interface. I think it would be best to be more specific, not less; I've thought about diverging from the vector interface entirely by requiring that the maximum size be specified up front (mandatory reserve) and not being able to grow in push_back at all, but I haven't done the legwork necessary to make sure that would actually be workable. |
Got it, thanks for the explanation.
…On Thu, Sep 4, 2025 at 00:08 Scott Wolchok ***@***.***> wrote:
*swolchok* left a comment (pybind/pybind11#5824)
<#5824 (comment)>
more generic name
This was feasible mostly because we can take a bunch of implementation
shortcuts knowing that the value type is py::handle, which is trivially
copyable and trivially destructible, and knowing that we only have to do a
limited subset of the vector interface. I think it would be best to be more
specific, not less; I've thought about diverging from the vector interface
entirely by requiring that the maximum size be specified up front
(mandatory reserve) and not being able to grow in push_back at all, but I
haven't done the legwork necessary to make sure that would actually be
workable.
—
Reply to this email directly, view it on GitHub
<#5824 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAFUZAFU3NBNTBXZMMG4J5D3Q7QP7AVCNFSM6AAAAACFSXWKLGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTENJSGIZTAOBYGA>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
…ents We don't have access to llvm::SmallVector or similar, but given the limited subset of the `std::vector` API that `function_call::args{,_convert}` need and the "reserve-then-fill" usage pattern, it is relatively straightforward to implement custom containers that get the job done. Seems to improves time to call the collatz function in pybind/pybind11_benchmark significantly; numbers are a little noisy but there's a clear improvement from "about 60 ns per call" to "about 45 ns per call" on my machine (M4 Max Mac), as measured with `timeit.repeat('collatz(4)', 'from pybind11_benchmark import collatz')`.
7c09a6b
to
9415686
Compare
(Please don't force push, because that makes it more difficult for me to follow along. Just keep merging, that works really well for me.) |
Hmm, looks like I managed to somehow break exactly IO redirection on exactly windows-latest but not windows-2022. Breakage includes clang-latest, so it's not an MSVC issue. Nothing jumps out at me as to the cause yet. |
Looks like windows-latest might have changed to windows-2025 very recently. Example job from yesterday where windows-latest meant windows-2022: https://github.com/pybind/pybind11/actions/runs/17443220310/job/49540880643 . GitHub does seem to say that the rollout started 2 days ago, so this is plausible: actions/runner-images#12677 I guess we should check whether the windows-latest workflows are broken on master without any changes? |
Sent #5825 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some super-quick comments, I can only spend 5 minutes right now. I'll probably get to it over the weekend.
include/pybind11/cast.h
Outdated
@@ -2045,10 +2046,12 @@ struct function_call { | |||
const function_record &func; | |||
|
|||
/// Arguments passed to the function: | |||
std::vector<handle> args; | |||
/// (Inline size chosen mostly arbitrarily; 5 should pad function_call out to two cache lines |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
6 should
?
Maybe make this a constexpr unsigned argument_vector_small_size = 6;
or similar, because you're using the constant in three (at least) places?
include/pybind11/pybind11.h
Outdated
if (overloaded) { | ||
// We're in the first no-convert pass, so swap out the conversion flags for a | ||
// set of all-false flags. If the call fails, we'll swap the flags back in for | ||
// the conversion-allowed call below. | ||
second_pass_convert.resize(func.nargs, false); | ||
second_pass_convert = args_convert_vector<6>(func.nargs, false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a terse comment to hint that creating a new object is better than some sort of resize semantics?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the reason I chose not to implement resize is that it's more general than we need -- you can resize containers to be smaller or larger and cross (or not cross) the inline size limit in either direction. we can just rewrite this in a way that more or less removes the question, though:
second_pass_convert = std::move(call.args_convert);
call.args_convert = args_convert_vector<argument_vector_small_size>(func.nargs, false);
(call.args_convert is moved-from, so we're not necessarily sure about its state.)
@@ -0,0 +1,84 @@ | |||
#include "pybind11/pybind11.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe create a new directory, test_low_level
or similar?
Putting this in test_embed
seems misleading.
@henryiii for opinion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't qualify for test/pure_cpp because it needs CPython around to compile py::handle
, so I put it in test_embed
because I was hoping not to have to duplicate the CMake configuration for C++ tests that need CPython around. I'll see if there's another way to do that, like creating a utility CMake file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if instead we renamed test_embed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't qualify for test/pure_cpp
Agreed/realized.
what if instead we renamed test_embed?
I'd be OK with that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
renamed test_embed
done
🐍 (macos-latest, 3.14t, -DCMAKE_CXX_STANDARD=20) / 🧪 It was hanging here:
I've seen this once before, a few days ago. Rerun triggered. |
…y clarify second_pass_convert
Small heads-up: I'll get to fully reviewing this only between Thu-Sun. How about "Low level" doesn't fit too well for the actual embedding tests. I think this accurately reflects that we're grouping tests together for a technical reason (I checked, all three existing |
Updated benchmark numbers quoted in the description to reflect latest master. Specific data for the 3 trials before/after was 54.2, 54.4, 54.6 nsec/loop before and 35.4, 35.6, 34.4 nsec/loop after on M4 Max. On a Linux box I have 99.5, 102, 101 nsec/loop before and 73, 72.3, 74.5 nsec/loop after. @rwgk anything I can do to make this easier to review? |
I just need to find the time; I can only spend "spare" time. I'll try to get to it asap. Did you see the email message I sent 10:01AM PST today (subject "pybind11 slack workspace")? |
Because it's potentially relevant to this review: I started poking around nanobind just now and found that it uses alloca() to accomplish the same goal as this PR. I would not recommend this here; |
(by the way, the CIBW failure looks unrelated? "Warning: An error occurred while preparing SDK package AOSP ATD Intel x86_64 Atom System Image: Error on ZipFile unknown archive.:") |
Please note that there are some checks in place for this. For keyword arguments, there is a check: /* The following lines allocate memory on the stack, which is very efficient
but also potentially dangerous since it can be used to generate stack
overflows. We refuse unrealistically large number of 'kwargs' (the
'max_nargs' value is fine since it is specified by the bindings) */
if (nkwargs_in > 1024) {
PyErr_SetString(PyExc_TypeError,
"nanobind::detail::nb_func_vectorcall(): too many (> "
"1024) keyword arguments.");
return nullptr;
} |
fair enough. the more pertinent reason not to use alloca() is that it would require both more invasive changes to |
…nt for global types nanobind has a similar two-level lookup strategy, added and explained by wjakob/nanobind@b515b1f In this PR I've ported this approach to pybind11. To avoid an ABI break, I've kept the fast maps to the `local_internals`. I think this should be safe because any particular module should see its `local_internals` reset at least as often as the global `internals`, and misses in the fast "hint" map for global types fall back to the global `internals`. Performance seems to have improved. Using my patched fork of pybind11_benchmark (https://github.com/swolchok/pybind11_benchmark/tree/benchmark-updates, specifically commit hash b6613d12607104d547b1c10a8145d1b3e9937266), I run bench.py and observe the MyInt case. Each time, I do 3 runs and just report all 3. master, Mac: 75.9, 76.9, 75.3 nsec/loop this PR, Mac: 73.8, 73.8, 73.6 nsec/loop master, Linux box: 188, 187, 188 nsec/loop this PR, Linux box: 164, 165, 164 nsec/loop Note that the "real" percentage improvement is larger than implied by the above because master does not yet include pybind#5824.
…nt for global types nanobind has a similar two-level lookup strategy, added and explained by wjakob/nanobind@b515b1f In this PR I've ported this approach to pybind11. To avoid an ABI break, I've kept the fast maps to the `local_internals`. I think this should be safe because any particular module should see its `local_internals` reset at least as often as the global `internals`, and misses in the fast "hint" map for global types fall back to the global `internals`. Performance seems to have improved. Using my patched fork of pybind11_benchmark (https://github.com/swolchok/pybind11_benchmark/tree/benchmark-updates, specifically commit hash b6613d12607104d547b1c10a8145d1b3e9937266), I run bench.py and observe the MyInt case. Each time, I do 3 runs and just report all 3. master, Mac: 75.9, 76.9, 75.3 nsec/loop this PR, Mac: 73.8, 73.8, 73.6 nsec/loop master, Linux box: 188, 187, 188 nsec/loop this PR, Linux box: 164, 165, 164 nsec/loop Note that the "real" percentage improvement is larger than implied by the above because master does not yet include pybind#5824.
vector.~heap_vector(); | ||
} | ||
} | ||
inline_array_or_vector(const inline_array_or_vector &) = delete; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A comment for easier reading?
E.g. // Disable copy ctor and assignment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I generally avoid comments that repeat the code, but if you specifically want one here then sure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally I'm totally with you ... just some minimal mercy for sleepy or rushed eyes scanning this code.
inline_array array; | ||
heap_vector vector; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WDYT about iarray
and hvector
(or iarr
, hvec
), to more clearly distinguish them from the std
types?
// of bytes. See | ||
// https://dev-discuss.pytorch.org/t/unionizing-for-profit-how-to-exploit-the-power-of-unions-in-c/444#the-memcpy-loophole-4 | ||
bool result = false; | ||
std::memcpy(&result, reinterpret_cast<const char *>(this), sizeof(bool)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I got here I had to consult ChatGPT:
https://chatgpt.com/share/68cb906b-941c-8008-9ac2-acbda6c78ca8
ChatGPT is happy with this implementation.
TBH, I'm not the best person to review code manually optimized to this intensity.
If you added the static_assert
s suggested by ChatGPT, it'd look safer to me.
@oremanj, is there a chance you could help out with a full review of this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, the linked article (written by me) tries to explain this technique as well. I recommend reading from the top (and skipping sections if they are boring review to you; it starts slowly) if you have time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What you have here is surely acceptable, but I can't help feeling uneasy about reaching so deeply into the bag of tricks. So I asked ChatGPT another question in the same conversation, same URL:
https://chatgpt.com/share/68cb906b-941c-8008-9ac2-acbda6c78ca8
It came back with this (copy-pasted here for easy reference):
template <typename ArrayT, std::size_t InlineSize, typename VectorT = ArrayT>
union inline_array_or_vector {
struct tag_view {
bool is_inline;
};
struct inline_array {
bool is_inline; // must be first
std::uint32_t size = 0;
std::array<ArrayT, InlineSize> arr;
};
struct heap_vector {
bool is_inline; // must be first
std::vector<VectorT> vec;
heap_vector() = default;
heap_vector(std::size_t count, VectorT value) : vec(count, value) {}
};
// Union members
tag_view tag; // “discriminator view”
inline_array array;
heap_vector vector;
inline_array_or_vector() : array{true, 0, {}} {}
~inline_array_or_vector() {
if (!is_inline()) {
vector.~heap_vector();
}
}
bool is_inline() const {
// Well-defined: all three arms share the same first member layout,
// so the common-initial-sequence rule lets us read tag.is_inline.
return tag.is_inline;
}
// ... move ctor/assign as in the PR, using array/vector arms ...
};
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ChatGPT is wrong. You can't read from an inactive member of a union because it's undefined behavior, full stop. C may allow it, but C++ does not.
(EDIT: see correction below; ChatGPT is only wrong because heap_vector isn't guaranteed standard-layout)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
C++: "It is undefined behavior to read from the member of the union that wasn't most recently written. Many compilers implement, as a non-standard language extension, the ability to read inactive members of a union." -- https://en.cppreference.com/w/cpp/language/union.html
C: "If the member used to access the contents of a union is not the same as the member last used to store a value, the object representation of the value that was stored is reinterpreted as an object representation of the new type (this is known as type punning)" -- https://en.cppreference.com/w/c/language/union.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I stand partially corrected. I asked your ChatGPT conversation for a citation about the common initial sequence rule, and sure enough it does exist in C++ (which I did not realize) and it implies that this codeyour proposed tag_view
change is fine, if inline_array
and heap_vector
are both standard-layout types. Sadly, heap_vector
is not guaranteed to be standard-layout because std::vector
is not guaranteed to be standard-layout, as we unfortunately ran into below. Good to know though!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation! (I posted my other comment a minute ago before seeing this. I updated that comment with a strikethrough.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A non-standardese citation for the common initial sequence rule (which, again, I had previously missed) is just "If two union members are standard-layout types, it's well-defined to examine their common subsequence on any compiler." -- https://en.cppreference.com/w/cpp/language/union.html
…st. Add static_asserts for our untagged union implementation per request.
Hmm, MSVC thinks that heap_vector is not a standard-layout type. Time for some research. Looks like /MDd as opposed to /MD is what causes the change: https://godbolt.org/z/axrWb5EM3 /MDd changes to a debugging version of the standard library: https://learn.microsoft.com/en-us/cpp/build/reference/md-mt-ld-use-run-time-library?view=msvc-170 and so it appears that the problem is because std::vector is not standard-layout in the debug library version: https://godbolt.org/z/ecP95nYje I am going to remove the is_standard_layout assertions, because as far as I know there is no reason the contained types have to be standard layout. I am keeping the offsetof assertions because the offsets do in fact have to match. (They don't have to be 0, but they really should be so I'm happy to assert that they are.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked some more, looks good to me.
I didn't look at every single line of the production code, and only glanced through the tests.
@oremanj If you could help out with a second set of eyes in a similar fashion, I'd be comfortable merging this (with or without the .tag_view
idea)
Hm, mostly for my curiosity, could you please explain why it'd be OK if the offset is not 0?
It seems like If not, wouldn't it have to be like this (untested)?
|
I meant that it would be ok because we could make the change you outlined. I'm fairly sure it's guaranteed to actually be zero given the code we've written. |
most recent CI run has a deadlock on only mingw64. Haven't seen this on previous runs and it seems odd that this change would cause a deadlock on exactly mingw, so I'd like to retry. I don't see a button to do that though; will let the one straggler job finish and then I guess just push another rev? |
I can help clicking the rerun button. |
sure enough, the mingw64 deadlock disappeared. |
For completeness: I had to cancel the CI / 🐍 (macos-latest, 3.14t, -DCMAKE_CXX_STANDARD=20) job, it was hanging. This wasn't the first time. Certainly not related to this PR. |
Thanks @oremanj for the review! |
Description
We don't have access to
llvm::SmallVector
or similar, but given the limited subset of thestd::vector
API thatfunction_call::args{,_convert}
need and the "reserve-then-fill" usage pattern, it is relatively straightforward to implement custom containers that get the job done.Seems to improves time to call the collatz function in pybind/pybind11_benchmark significantly; numbers are a little noisy but there's a clear improvement from "about 54 ns per call" to "about 35 ns per call" on my machine (M4 Max Mac), as measured with
timeit.repeat('collatz(4)', 'from pybind11_benchmark import collatz')
.Suggested changelog entry: