Avoid heap allocation for function calls with a small number of args #5824

swolchok · 2025-09-04T05:38:44Z

Description

We don't have access to llvm::SmallVector or similar, but given the limited subset of the std::vector API that
function_call::args{,_convert} need and the "reserve-then-fill" usage pattern, it is relatively straightforward to implement custom containers that get the job done.

Seems to improves time to call the collatz function in pybind/pybind11_benchmark significantly; numbers are a little noisy but there's a clear improvement from "about 54 ns per call" to "about 35 ns per call" on my machine (M4 Max Mac), as measured with timeit.repeat('collatz(4)', 'from pybind11_benchmark import collatz').

Suggested changelog entry:

Dispatching functions with 6 or fewer arguments no longer requires a heap allocation for their C++ argument array.

swolchok · 2025-09-04T05:43:57Z

Didn't fully realize pybind11 continues to support C++11 and C++14, hence failing CI jobs. Can remove std::variant, which I think will also save memory and allow inline size increase. Would be nice to have directional review feedback if folks see this before I get that done, though.

rwgk · 2025-09-04T06:31:03Z

Awesome.

Can remove std::variant, which I think will also save memory and allow inline size increase.

That'd be great.

Could you please move struct argument_vector to a separate file under include/pybind11/detail?

WDYT about a more generic name, e.g. hybrid_vector? I think that'd fit in nicely between SmallVector and vector.

You'll have to update the top-level CMakeLists.txt and tests/extra_python_package/test_files.py for the new file in include/pybind11/detail. Look for e.g. dynamic_raw_ptr_cast_if_possible.h to pin-point where to add the new filename.

swolchok · 2025-09-04T07:08:24Z

more generic name

This was feasible mostly because we can take a bunch of implementation shortcuts knowing that the value type is py::handle, which is trivially copyable and trivially destructible, and knowing that we only have to do a limited subset of the vector interface. I think it would be best to be more specific, not less; I've thought about diverging from the vector interface entirely by requiring that the maximum size be specified up front (mandatory reserve) and not being able to grow in push_back at all, but I haven't done the legwork necessary to make sure that would actually be workable.

rwgk · 2025-09-04T07:14:31Z

Got it, thanks for the explanation.

…

On Thu, Sep 4, 2025 at 00:08 Scott Wolchok ***@***.***> wrote: *swolchok* left a comment (pybind/pybind11#5824) <#5824 (comment)> more generic name This was feasible mostly because we can take a bunch of implementation shortcuts knowing that the value type is py::handle, which is trivially copyable and trivially destructible, and knowing that we only have to do a limited subset of the vector interface. I think it would be best to be more specific, not less; I've thought about diverging from the vector interface entirely by requiring that the maximum size be specified up front (mandatory reserve) and not being able to grow in push_back at all, but I haven't done the legwork necessary to make sure that would actually be workable. — Reply to this email directly, view it on GitHub <#5824 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAFUZAFU3NBNTBXZMMG4J5D3Q7QP7AVCNFSM6AAAAACFSXWKLGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTENJSGIZTAOBYGA> . You are receiving this because you commented.Message ID: ***@***.***>

…ents We don't have access to llvm::SmallVector or similar, but given the limited subset of the `std::vector` API that `function_call::args{,_convert}` need and the "reserve-then-fill" usage pattern, it is relatively straightforward to implement custom containers that get the job done. Seems to improves time to call the collatz function in pybind/pybind11_benchmark significantly; numbers are a little noisy but there's a clear improvement from "about 60 ns per call" to "about 45 ns per call" on my machine (M4 Max Mac), as measured with `timeit.repeat('collatz(4)', 'from pybind11_benchmark import collatz')`.

rwgk · 2025-09-04T19:52:22Z

(Please don't force push, because that makes it more difficult for me to follow along. Just keep merging, that works really well for me.)

swolchok · 2025-09-04T20:36:31Z

Hmm, looks like I managed to somehow break exactly IO redirection on exactly windows-latest but not windows-2022. Breakage includes clang-latest, so it's not an MSVC issue. Nothing jumps out at me as to the cause yet.

swolchok · 2025-09-04T21:00:54Z

Looks like windows-latest might have changed to windows-2025 very recently. Example job from yesterday where windows-latest meant windows-2022: https://github.com/pybind/pybind11/actions/runs/17443220310/job/49540880643 . GitHub does seem to say that the rollout started 2 days ago, so this is plausible: actions/runner-images#12677

I guess we should check whether the windows-latest workflows are broken on master without any changes?

swolchok · 2025-09-04T21:12:44Z

check whether the windows-latest workflows are broken on master without any changes?

Sent #5825

rwgk

Some super-quick comments, I can only spend 5 minutes right now. I'll probably get to it over the weekend.

rwgk · 2025-09-05T21:19:12Z

include/pybind11/cast.h

@@ -2045,10 +2046,12 @@ struct function_call {
    const function_record &func;

    /// Arguments passed to the function:
-    std::vector<handle> args;
+    /// (Inline size chosen mostly arbitrarily; 5 should pad function_call out to two cache lines


6 should?

Maybe make this a constexpr unsigned argument_vector_small_size = 6; or similar, because you're using the constant in three (at least) places?

rwgk · 2025-09-05T21:25:05Z

include/pybind11/pybind11.h

                if (overloaded) {
                    // We're in the first no-convert pass, so swap out the conversion flags for a
                    // set of all-false flags.  If the call fails, we'll swap the flags back in for
                    // the conversion-allowed call below.
-                    second_pass_convert.resize(func.nargs, false);
+                    second_pass_convert = args_convert_vector<6>(func.nargs, false);


Could you add a terse comment to hint that creating a new object is better than some sort of resize semantics?

the reason I chose not to implement resize is that it's more general than we need -- you can resize containers to be smaller or larger and cross (or not cross) the inline size limit in either direction. we can just rewrite this in a way that more or less removes the question, though:

second_pass_convert = std::move(call.args_convert); call.args_convert = args_convert_vector<argument_vector_small_size>(func.nargs, false);

(call.args_convert is moved-from, so we're not necessarily sure about its state.)

rwgk · 2025-09-05T21:28:41Z

tests/test_embed/test_args_convert_vector.cpp

@@ -0,0 +1,84 @@
+#include "pybind11/pybind11.h"


Maybe create a new directory, test_low_level or similar?

Putting this in test_embed seems misleading.

@henryiii for opinion

This doesn't qualify for test/pure_cpp because it needs CPython around to compile py::handle, so I put it in test_embed because I was hoping not to have to duplicate the CMake configuration for C++ tests that need CPython around. I'll see if there's another way to do that, like creating a utility CMake file.

what if instead we renamed test_embed?

This doesn't qualify for test/pure_cpp
Agreed/realized.

what if instead we renamed test_embed?
I'd be OK with that.

@henryiii?

renamed test_embed

done

rwgk · 2025-09-05T21:31:54Z

🐍 (macos-latest, 3.14t, -DCMAKE_CXX_STANDARD=20) / 🧪
cancelled now in 5h 55m 47s

It was hanging here:

Run cmake --build build --target cpptest
Change Dir: '/Users/runner/work/pybind11/pybind11/build'
Run Build Command(s): /opt/homebrew/bin/ninja -v cpptest
[0/2] /opt/homebrew/bin/cmake -P /Users/runner/work/pybind11/pybind11/build/CMakeFiles/VerifyGlobs.cmake

I've seen this once before, a few days ago.

Rerun triggered.

…y clarify second_pass_convert

rwgk · 2025-09-08T21:30:59Z

Small heads-up: I'll get to fully reviewing this only between Thu-Sun.

How about test_with_catch as the new directory name?

"Low level" doesn't fit too well for the actual embedding tests.

I think this accurately reflects that we're grouping tests together for a technical reason (I checked, all three existing .cpp files include catch.hpp).

swolchok · 2025-09-15T16:09:45Z

Updated benchmark numbers quoted in the description to reflect latest master. Specific data for the 3 trials before/after was 54.2, 54.4, 54.6 nsec/loop before and 35.4, 35.6, 34.4 nsec/loop after on M4 Max. On a Linux box I have 99.5, 102, 101 nsec/loop before and 73, 72.3, 74.5 nsec/loop after.

@rwgk anything I can do to make this easier to review?

rwgk · 2025-09-15T17:40:46Z

Updated benchmark numbers quoted in the description to reflect latest master. Specific data for the 3 trials before/after was 54.2, 54.4, 54.6 nsec/loop before and 35.4, 35.6, 34.4 nsec/loop after on M4 Max. On a Linux box I have 99.5, 102, 101 nsec/loop before and 73, 72.3, 74.5 nsec/loop after.

@rwgk anything I can do to make this easier to review?

I just need to find the time; I can only spend "spare" time. I'll try to get to it asap.

Did you see the email message I sent 10:01AM PST today (subject "pybind11 slack workspace")?

swolchok · 2025-09-16T16:06:35Z

Because it's potentially relevant to this review: I started poking around nanobind just now and found that it uses alloca() to accomplish the same goal as this PR. I would not recommend this here; alloca() is, to quote its manpage on macOS, "slightly unsafe" because you can't in general rely on having enough space left on the stack for your allocation. To be fair, alloca() is probably fine for a sufficiently small number of arguments (after all, calling a function is slightly unsafe in general for the same reason), but we can get the same effect without the sharp edge by using a reasonably-sized fixed-size array as I've done here. (A way to make alloca() usage similarly safe is to use alloca() below some arbitrary upper bound and malloc() above it.)

swolchok · 2025-09-16T16:07:18Z

(by the way, the CIBW failure looks unrelated? "Warning: An error occurred while preparing SDK package AOSP ATD Intel x86_64 Atom System Image: Error on ZipFile unknown archive.:")

wjakob · 2025-09-17T07:19:55Z

I would not recommend this here; alloca() is, to quote its manpage on macOS, "slightly unsafe" because you can't in general rely on having enough space left on the stack for your allocation.

Please note that there are some checks in place for this. max_nargs is provided by the bindings and not the caller. To trigger an overflow here, you would need to make a C++ binding with hundreds of thousands of function arguments.

For keyword arguments, there is a check:

    /* The following lines allocate memory on the stack, which is very efficient
       but also potentially dangerous since it can be used to generate stack
       overflows. We refuse unrealistically large number of 'kwargs' (the
       'max_nargs' value is fine since it is specified by the bindings) */
    if (nkwargs_in > 1024) {
        PyErr_SetString(PyExc_TypeError,
                        "nanobind::detail::nb_func_vectorcall(): too many (> "
                        "1024) keyword arguments.");
        return nullptr;
    }

swolchok · 2025-09-17T16:31:43Z

fair enough. the more pertinent reason not to use alloca() is that it would require both more invasive changes to struct function_call and also ongoing care to make sure struct function_call never outlives its defining stack frame lest the alloca()ed pointers dangle, whereas the argument_vector approach drops in cleanly. If folks want alloca() anyway we can try it.

…nt for global types nanobind has a similar two-level lookup strategy, added and explained by wjakob/nanobind@b515b1f In this PR I've ported this approach to pybind11. To avoid an ABI break, I've kept the fast maps to the `local_internals`. I think this should be safe because any particular module should see its `local_internals` reset at least as often as the global `internals`, and misses in the fast "hint" map for global types fall back to the global `internals`. Performance seems to have improved. Using my patched fork of pybind11_benchmark (https://github.com/swolchok/pybind11_benchmark/tree/benchmark-updates, specifically commit hash b6613d12607104d547b1c10a8145d1b3e9937266), I run bench.py and observe the MyInt case. Each time, I do 3 runs and just report all 3. master, Mac: 75.9, 76.9, 75.3 nsec/loop this PR, Mac: 73.8, 73.8, 73.6 nsec/loop master, Linux box: 188, 187, 188 nsec/loop this PR, Linux box: 164, 165, 164 nsec/loop Note that the "real" percentage improvement is larger than implied by the above because master does not yet include pybind#5824.

include/pybind11/cast.h

rwgk · 2025-09-18T04:28:02Z

include/pybind11/detail/argument_vector.h

+            vector.~heap_vector();
+        }
+    }
+    inline_array_or_vector(const inline_array_or_vector &) = delete;


A comment for easier reading?

E.g. // Disable copy ctor and assignment

I generally avoid comments that repeat the code, but if you specifically want one here then sure

Generally I'm totally with you ... just some minimal mercy for sleepy or rushed eyes scanning this code.

rwgk · 2025-09-18T04:32:32Z

include/pybind11/detail/argument_vector.h

+    inline_array array;
+    heap_vector vector;


WDYT about iarray and hvector (or iarr, hvec), to more clearly distinguish them from the std types?

rwgk · 2025-09-18T05:02:17Z

include/pybind11/detail/argument_vector.h

+        // of bytes. See
+        // https://dev-discuss.pytorch.org/t/unionizing-for-profit-how-to-exploit-the-power-of-unions-in-c/444#the-memcpy-loophole-4
+        bool result = false;
+        std::memcpy(&result, reinterpret_cast<const char *>(this), sizeof(bool));


When I got here I had to consult ChatGPT:

https://chatgpt.com/share/68cb906b-941c-8008-9ac2-acbda6c78ca8

ChatGPT is happy with this implementation.

TBH, I'm not the best person to review code manually optimized to this intensity.

If you added the static_asserts suggested by ChatGPT, it'd look safer to me.

@oremanj, is there a chance you could help out with a full review of this PR?

FWIW, the linked article (written by me) tries to explain this technique as well. I recommend reading from the top (and skipping sections if they are boring review to you; it starts slowly) if you have time.

What you have here is surely acceptable, but I can't help feeling uneasy about reaching so deeply into the bag of tricks. So I asked ChatGPT another question in the same conversation, same URL:

https://chatgpt.com/share/68cb906b-941c-8008-9ac2-acbda6c78ca8

It came back with this (copy-pasted here for easy reference):

template <typename ArrayT, std::size_t InlineSize, typename VectorT = ArrayT> union inline_array_or_vector { struct tag_view { bool is_inline; }; struct inline_array { bool is_inline; // must be first std::uint32_t size = 0; std::array<ArrayT, InlineSize> arr; }; struct heap_vector { bool is_inline; // must be first std::vector<VectorT> vec; heap_vector() = default; heap_vector(std::size_t count, VectorT value) : vec(count, value) {} }; // Union members tag_view tag; // “discriminator view” inline_array array; heap_vector vector; inline_array_or_vector() : array{true, 0, {}} {} ~inline_array_or_vector() { if (!is_inline()) { vector.~heap_vector(); } } bool is_inline() const { // Well-defined: all three arms share the same first member layout, // so the common-initial-sequence rule lets us read tag.is_inline. return tag.is_inline; } // ... move ctor/assign as in the PR, using array/vector arms ... };

WDYT?

ChatGPT is wrong. You can't read from an inactive member of a union because it's undefined behavior, full stop. C may allow it, but C++ does not.

(EDIT: see correction below; ChatGPT is only wrong because heap_vector isn't guaranteed standard-layout)

C++: "It is undefined behavior to read from the member of the union that wasn't most recently written. Many compilers implement, as a non-standard language extension, the ability to read inactive members of a union." -- https://en.cppreference.com/w/cpp/language/union.html

C: "If the member used to access the contents of a union is not the same as the member last used to store a value, the object representation of the value that was stored is reinterpreted as an object representation of the new type (this is known as type punning)" -- https://en.cppreference.com/w/c/language/union.html

Hmm, I stand partially corrected. I asked your ChatGPT conversation for a citation about the common initial sequence rule, and sure enough it does exist in C++ (which I did not realize) and it implies that ~~this code~~your proposed tag_view change is fine, if inline_array and heap_vector are both standard-layout types. Sadly, heap_vector is not guaranteed to be standard-layout because std::vector is not guaranteed to be standard-layout, as we unfortunately ran into below. Good to know though!

Thanks for the explanation! (I posted my other comment a minute ago before seeing this. I updated that comment with a strikethrough.)

A non-standardese citation for the common initial sequence rule (which, again, I had previously missed) is just "If two union members are standard-layout types, it's well-defined to examine their common subsequence on any compiler." -- https://en.cppreference.com/w/cpp/language/union.html

…st. Add static_asserts for our untagged union implementation per request.

swolchok · 2025-09-18T17:10:11Z

Hmm, MSVC thinks that heap_vector is not a standard-layout type. Time for some research.

Looks like /MDd as opposed to /MD is what causes the change: https://godbolt.org/z/axrWb5EM3

/MDd changes to a debugging version of the standard library: https://learn.microsoft.com/en-us/cpp/build/reference/md-mt-ld-use-run-time-library?view=msvc-170

and so it appears that the problem is because std::vector is not standard-layout in the debug library version: https://godbolt.org/z/ecP95nYje

I am going to remove the is_standard_layout assertions, because as far as I know there is no reason the contained types have to be standard layout. I am keeping the offsetof assertions because the offsets do in fact have to match. (They don't have to be 0, but they really should be so I'm happy to assert that they are.)

rwgk

I looked some more, looks good to me.

I didn't look at every single line of the production code, and only glanced through the tests.

@oremanj If you could help out with a second set of eyes in a similar fashion, I'd be comfortable merging this ~~(with or without the tag_view idea)~~.

rwgk · 2025-09-18T18:10:33Z

I am going to remove the is_standard_layout assertions, because as far as I know there is no reason the contained types have to be standard layout. I am keeping the offsetof assertions because the offsets do in fact have to match. (They don't have to be 0, but they really should be so I'm happy to assert that they are.)

Hm, mostly for my curiosity, could you please explain why it'd be OK if the offset is not 0?

        bool result = false;
        std::memcpy(&result, reinterpret_cast<const char *>(this), sizeof(bool));

It seems like memcpy will copy from offset 0, no?

If not, wouldn't it have to be like this (untested)?

        bool result = false;
        static_assert(offsetof(inline_array, is_inline) == offsetof(heap_vector, is_inline), "untagged union implementation relies on this");
        std::memcpy(&result, reinterpret_cast<const char *>(this) + offsetof(inline_array, is_inline), sizeof(bool));

swolchok · 2025-09-18T18:39:02Z

Hm, mostly for my curiosity, could you please explain why it'd be OK if the offset is not 0?

I meant that it would be ok because we could make the change you outlined. I'm fairly sure it's guaranteed to actually be zero given the code we've written.

swolchok · 2025-09-18T20:07:09Z

most recent CI run has a deadlock on only mingw64. Haven't seen this on previous runs and it seems odd that this change would cause a deadlock on exactly mingw, so I'd like to retry. I don't see a button to do that though; will let the one straggler job finish and then I guess just push another rev?

rwgk · 2025-09-18T20:14:32Z

I can help clicking the rerun button.

swolchok · 2025-09-18T20:57:27Z

sure enough, the mingw64 deadlock disappeared.

rwgk · 2025-09-18T20:58:16Z

For completeness: I had to cancel the CI / 🐍 (macos-latest, 3.14t, -DCMAKE_CXX_STANDARD=20) job, it was hanging. This wasn't the first time. Certainly not related to this PR.

rwgk · 2025-09-19T20:44:35Z

Thanks @oremanj for the review!

swolchok requested a review from henryiii as a code owner September 4, 2025 05:38

swolchok force-pushed the argument-small-vectors branch from 7c09a6b to 9415686 Compare September 4, 2025 19:49

clang-tidy

09fc719

swolchok mentioned this pull request Sep 4, 2025

NOT FOR MERGE -- testing to see if windows-latest jobs are broken #5825

Closed

swolchok added 4 commits September 4, 2025 15:51

more clang-tidy

8dd2eab

clang-tidy NOLINTBEGIN/END instead of NOLINTNEXTLINE

8219f66

forgot to increase inline size after removing std::variant

e4c68be

Merge branch 'master' into argument-small-vectors

3674314

rwgk reviewed Sep 5, 2025

View reviewed changes

swolchok added 2 commits September 5, 2025 15:46

constexpr arg_vector_small_size, use move instead of swap to hopefull…

3c435f1

…y clarify second_pass_convert

rename test_embed to test_low_level

f7bd49e

rename test_low_level to test_with_catch

6e2da02

swolchok mentioned this pull request Sep 11, 2025

Fully native DTensor.__new__ pytorch/pytorch#162508

Open

swolchok added 3 commits September 12, 2025 16:50

Merge branch 'master' into argument-small-vectors

97d77f6

Be careful to NOINLINE slow paths

40aaf36

Merge branch 'master' into argument-small-vectors

567f80d

swolchok mentioned this pull request Sep 17, 2025

Add fast_type_map, use it authoritatively for local types and as a hint for global types #5842

Open

rwgk reviewed Sep 18, 2025

View reviewed changes

rename array/vector members to iarray/hvector. Move comment per reque…

a2c8422

…st. Add static_asserts for our untagged union implementation per request.

drop is_standard_layout assertions; see pybind#5824 (comment)

a2dddb4

rwgk reviewed Sep 18, 2025

View reviewed changes

oremanj approved these changes Sep 19, 2025

View reviewed changes

rwgk merged commit 30748f8 into pybind:master Sep 19, 2025
146 of 148 checks passed

github-actions bot added the needs changelog Possibly needs a changelog entry label Sep 19, 2025

swolchok deleted the argument-small-vectors branch September 19, 2025 21:45

Avoid heap allocation for function calls with a small number of args #5824

Avoid heap allocation for function calls with a small number of args #5824

Uh oh!

Conversation

swolchok commented Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Suggested changelog entry:

Uh oh!

swolchok commented Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rwgk commented Sep 4, 2025

Uh oh!

swolchok commented Sep 4, 2025

Uh oh!

rwgk commented Sep 4, 2025 via email

Uh oh!

rwgk commented Sep 4, 2025

Uh oh!

swolchok commented Sep 4, 2025

Uh oh!

swolchok commented Sep 4, 2025

Uh oh!

swolchok commented Sep 4, 2025

Uh oh!

rwgk left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rwgk commented Sep 5, 2025

Uh oh!

rwgk commented Sep 8, 2025

Uh oh!

swolchok commented Sep 15, 2025

Uh oh!

rwgk commented Sep 15, 2025

Uh oh!

swolchok commented Sep 16, 2025

Uh oh!

swolchok commented Sep 16, 2025

Uh oh!

wjakob commented Sep 17, 2025

Uh oh!

swolchok commented Sep 17, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

swolchok Sep 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

swolchok commented Sep 4, 2025 •

edited

Loading

swolchok commented Sep 4, 2025 •

edited

Loading

swolchok Sep 18, 2025 •

edited

Loading

swolchok Sep 18, 2025 •

edited

Loading

rwgk left a comment •

edited

Loading