Skip to content

Conversation

swolchok
Copy link
Contributor

@swolchok swolchok commented Sep 4, 2025

Description

We don't have access to llvm::SmallVector or similar, but given the limited subset of the std::vector API that
function_call::args{,_convert} need and the "reserve-then-fill" usage pattern, it is relatively straightforward to implement custom containers that get the job done.

Seems to improves time to call the collatz function in pybind/pybind11_benchmark significantly; numbers are a little noisy but there's a clear improvement from "about 54 ns per call" to "about 35 ns per call" on my machine (M4 Max Mac), as measured with timeit.repeat('collatz(4)', 'from pybind11_benchmark import collatz').

Suggested changelog entry:

  • Dispatching functions with 6 or fewer arguments no longer requires a heap allocation for their C++ argument array.

@swolchok swolchok requested a review from henryiii as a code owner September 4, 2025 05:38
@swolchok
Copy link
Contributor Author

swolchok commented Sep 4, 2025

Didn't fully realize pybind11 continues to support C++11 and C++14, hence failing CI jobs. Can remove std::variant, which I think will also save memory and allow inline size increase. Would be nice to have directional review feedback if folks see this before I get that done, though.

@rwgk
Copy link
Collaborator

rwgk commented Sep 4, 2025

Awesome.

Can remove std::variant, which I think will also save memory and allow inline size increase.

That'd be great.

Could you please move struct argument_vector to a separate file under include/pybind11/detail?

WDYT about a more generic name, e.g. hybrid_vector? I think that'd fit in nicely between SmallVector and vector.

You'll have to update the top-level CMakeLists.txt and tests/extra_python_package/test_files.py for the new file in include/pybind11/detail. Look for e.g. dynamic_raw_ptr_cast_if_possible.h to pin-point where to add the new filename.

@swolchok
Copy link
Contributor Author

swolchok commented Sep 4, 2025

more generic name

This was feasible mostly because we can take a bunch of implementation shortcuts knowing that the value type is py::handle, which is trivially copyable and trivially destructible, and knowing that we only have to do a limited subset of the vector interface. I think it would be best to be more specific, not less; I've thought about diverging from the vector interface entirely by requiring that the maximum size be specified up front (mandatory reserve) and not being able to grow in push_back at all, but I haven't done the legwork necessary to make sure that would actually be workable.

@rwgk
Copy link
Collaborator

rwgk commented Sep 4, 2025 via email

…ents

We don't have access to llvm::SmallVector or similar, but given the
limited subset of the `std::vector` API that
`function_call::args{,_convert}` need and the "reserve-then-fill"
usage pattern, it is relatively straightforward to implement custom
containers that get the job done.

Seems to improves time to call the collatz function in
pybind/pybind11_benchmark significantly; numbers are a little noisy
but there's a clear improvement from "about 60 ns per call" to "about
45 ns per call" on my machine (M4 Max Mac), as measured with
`timeit.repeat('collatz(4)', 'from pybind11_benchmark import
collatz')`.
@swolchok swolchok force-pushed the argument-small-vectors branch from 7c09a6b to 9415686 Compare September 4, 2025 19:49
@rwgk
Copy link
Collaborator

rwgk commented Sep 4, 2025

(Please don't force push, because that makes it more difficult for me to follow along. Just keep merging, that works really well for me.)

@swolchok
Copy link
Contributor Author

swolchok commented Sep 4, 2025

Hmm, looks like I managed to somehow break exactly IO redirection on exactly windows-latest but not windows-2022. Breakage includes clang-latest, so it's not an MSVC issue. Nothing jumps out at me as to the cause yet.

@swolchok
Copy link
Contributor Author

swolchok commented Sep 4, 2025

Looks like windows-latest might have changed to windows-2025 very recently. Example job from yesterday where windows-latest meant windows-2022: https://github.com/pybind/pybind11/actions/runs/17443220310/job/49540880643 . GitHub does seem to say that the rollout started 2 days ago, so this is plausible: actions/runner-images#12677

I guess we should check whether the windows-latest workflows are broken on master without any changes?

@swolchok
Copy link
Contributor Author

swolchok commented Sep 4, 2025

check whether the windows-latest workflows are broken on master without any changes?

Sent #5825

Copy link
Collaborator

@rwgk rwgk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some super-quick comments, I can only spend 5 minutes right now. I'll probably get to it over the weekend.

@@ -2045,10 +2046,12 @@ struct function_call {
const function_record &func;

/// Arguments passed to the function:
std::vector<handle> args;
/// (Inline size chosen mostly arbitrarily; 5 should pad function_call out to two cache lines
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

6 should?

Maybe make this a constexpr unsigned argument_vector_small_size = 6; or similar, because you're using the constant in three (at least) places?

if (overloaded) {
// We're in the first no-convert pass, so swap out the conversion flags for a
// set of all-false flags. If the call fails, we'll swap the flags back in for
// the conversion-allowed call below.
second_pass_convert.resize(func.nargs, false);
second_pass_convert = args_convert_vector<6>(func.nargs, false);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a terse comment to hint that creating a new object is better than some sort of resize semantics?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the reason I chose not to implement resize is that it's more general than we need -- you can resize containers to be smaller or larger and cross (or not cross) the inline size limit in either direction. we can just rewrite this in a way that more or less removes the question, though:

second_pass_convert = std::move(call.args_convert);
call.args_convert = args_convert_vector<argument_vector_small_size>(func.nargs, false);

(call.args_convert is moved-from, so we're not necessarily sure about its state.)

@@ -0,0 +1,84 @@
#include "pybind11/pybind11.h"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe create a new directory, test_low_level or similar?

Putting this in test_embed seems misleading.

@henryiii for opinion

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't qualify for test/pure_cpp because it needs CPython around to compile py::handle, so I put it in test_embed because I was hoping not to have to duplicate the CMake configuration for C++ tests that need CPython around. I'll see if there's another way to do that, like creating a utility CMake file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if instead we renamed test_embed?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't qualify for test/pure_cpp
Agreed/realized.

what if instead we renamed test_embed?
I'd be OK with that.

@henryiii?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed test_embed

done

@rwgk
Copy link
Collaborator

rwgk commented Sep 5, 2025

🐍 (macos-latest, 3.14t, -DCMAKE_CXX_STANDARD=20) / 🧪
cancelled now in 5h 55m 47s

It was hanging here:

Run cmake --build build --target cpptest
Change Dir: '/Users/runner/work/pybind11/pybind11/build'
Run Build Command(s): /opt/homebrew/bin/ninja -v cpptest
[0/2] /opt/homebrew/bin/cmake -P /Users/runner/work/pybind11/pybind11/build/CMakeFiles/VerifyGlobs.cmake

I've seen this once before, a few days ago.

Rerun triggered.

@rwgk
Copy link
Collaborator

rwgk commented Sep 8, 2025

Small heads-up: I'll get to fully reviewing this only between Thu-Sun.

How about test_with_catch as the new directory name?

"Low level" doesn't fit too well for the actual embedding tests.

I think this accurately reflects that we're grouping tests together for a technical reason (I checked, all three existing .cpp files include catch.hpp).

@swolchok
Copy link
Contributor Author

Updated benchmark numbers quoted in the description to reflect latest master. Specific data for the 3 trials before/after was 54.2, 54.4, 54.6 nsec/loop before and 35.4, 35.6, 34.4 nsec/loop after on M4 Max. On a Linux box I have 99.5, 102, 101 nsec/loop before and 73, 72.3, 74.5 nsec/loop after.

@rwgk anything I can do to make this easier to review?

@rwgk
Copy link
Collaborator

rwgk commented Sep 15, 2025

Updated benchmark numbers quoted in the description to reflect latest master. Specific data for the 3 trials before/after was 54.2, 54.4, 54.6 nsec/loop before and 35.4, 35.6, 34.4 nsec/loop after on M4 Max. On a Linux box I have 99.5, 102, 101 nsec/loop before and 73, 72.3, 74.5 nsec/loop after.

@rwgk anything I can do to make this easier to review?

I just need to find the time; I can only spend "spare" time. I'll try to get to it asap.

Did you see the email message I sent 10:01AM PST today (subject "pybind11 slack workspace")?

@swolchok
Copy link
Contributor Author

Because it's potentially relevant to this review: I started poking around nanobind just now and found that it uses alloca() to accomplish the same goal as this PR. I would not recommend this here; alloca() is, to quote its manpage on macOS, "slightly unsafe" because you can't in general rely on having enough space left on the stack for your allocation. To be fair, alloca() is probably fine for a sufficiently small number of arguments (after all, calling a function is slightly unsafe in general for the same reason), but we can get the same effect without the sharp edge by using a reasonably-sized fixed-size array as I've done here. (A way to make alloca() usage similarly safe is to use alloca() below some arbitrary upper bound and malloc() above it.)

@swolchok
Copy link
Contributor Author

(by the way, the CIBW failure looks unrelated? "Warning: An error occurred while preparing SDK package AOSP ATD Intel x86_64 Atom System Image: Error on ZipFile unknown archive.:")

@wjakob
Copy link
Member

wjakob commented Sep 17, 2025

I would not recommend this here; alloca() is, to quote its manpage on macOS, "slightly unsafe" because you can't in general rely on having enough space left on the stack for your allocation.

Please note that there are some checks in place for this. max_nargs is provided by the bindings and not the caller. To trigger an overflow here, you would need to make a C++ binding with hundreds of thousands of function arguments.

For keyword arguments, there is a check:

    /* The following lines allocate memory on the stack, which is very efficient
       but also potentially dangerous since it can be used to generate stack
       overflows. We refuse unrealistically large number of 'kwargs' (the
       'max_nargs' value is fine since it is specified by the bindings) */
    if (nkwargs_in > 1024) {
        PyErr_SetString(PyExc_TypeError,
                        "nanobind::detail::nb_func_vectorcall(): too many (> "
                        "1024) keyword arguments.");
        return nullptr;
    }

@swolchok
Copy link
Contributor Author

fair enough. the more pertinent reason not to use alloca() is that it would require both more invasive changes to struct function_call and also ongoing care to make sure struct function_call never outlives its defining stack frame lest the alloca()ed pointers dangle, whereas the argument_vector approach drops in cleanly. If folks want alloca() anyway we can try it.

swolchok added a commit to swolchok/pybind11 that referenced this pull request Sep 17, 2025
…nt for global types

nanobind has a similar two-level lookup strategy, added and explained
by
wjakob/nanobind@b515b1f

In this PR I've ported this approach to pybind11. To avoid an ABI
break, I've kept the fast maps to the `local_internals`. I think this
should be safe because any particular module should see its
`local_internals` reset at least as often as the global `internals`,
and misses in the fast "hint" map for global types fall back to the
global `internals`.

Performance seems to have improved. Using my patched fork of
pybind11_benchmark
(https://github.com/swolchok/pybind11_benchmark/tree/benchmark-updates,
specifically commit hash b6613d12607104d547b1c10a8145d1b3e9937266), I
run bench.py and observe the MyInt case. Each time, I do 3 runs and
just report all 3.

master, Mac: 75.9, 76.9, 75.3 nsec/loop
this PR, Mac: 73.8, 73.8, 73.6 nsec/loop
master, Linux box: 188, 187, 188 nsec/loop
this PR, Linux box: 164, 165, 164 nsec/loop

Note that the "real" percentage improvement is larger than implied by the
above because master does not yet include pybind#5824.
swolchok added a commit to swolchok/pybind11 that referenced this pull request Sep 17, 2025
…nt for global types

nanobind has a similar two-level lookup strategy, added and explained
by
wjakob/nanobind@b515b1f

In this PR I've ported this approach to pybind11. To avoid an ABI
break, I've kept the fast maps to the `local_internals`. I think this
should be safe because any particular module should see its
`local_internals` reset at least as often as the global `internals`,
and misses in the fast "hint" map for global types fall back to the
global `internals`.

Performance seems to have improved. Using my patched fork of
pybind11_benchmark
(https://github.com/swolchok/pybind11_benchmark/tree/benchmark-updates,
specifically commit hash b6613d12607104d547b1c10a8145d1b3e9937266), I
run bench.py and observe the MyInt case. Each time, I do 3 runs and
just report all 3.

master, Mac: 75.9, 76.9, 75.3 nsec/loop
this PR, Mac: 73.8, 73.8, 73.6 nsec/loop
master, Linux box: 188, 187, 188 nsec/loop
this PR, Linux box: 164, 165, 164 nsec/loop

Note that the "real" percentage improvement is larger than implied by the
above because master does not yet include pybind#5824.
vector.~heap_vector();
}
}
inline_array_or_vector(const inline_array_or_vector &) = delete;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A comment for easier reading?

E.g. // Disable copy ctor and assignment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I generally avoid comments that repeat the code, but if you specifically want one here then sure

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally I'm totally with you ... just some minimal mercy for sleepy or rushed eyes scanning this code.

Comment on lines 64 to 65
inline_array array;
heap_vector vector;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WDYT about iarray and hvector (or iarr, hvec), to more clearly distinguish them from the std types?

// of bytes. See
// https://dev-discuss.pytorch.org/t/unionizing-for-profit-how-to-exploit-the-power-of-unions-in-c/444#the-memcpy-loophole-4
bool result = false;
std::memcpy(&result, reinterpret_cast<const char *>(this), sizeof(bool));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I got here I had to consult ChatGPT:

https://chatgpt.com/share/68cb906b-941c-8008-9ac2-acbda6c78ca8

ChatGPT is happy with this implementation.

TBH, I'm not the best person to review code manually optimized to this intensity.

If you added the static_asserts suggested by ChatGPT, it'd look safer to me.

@oremanj, is there a chance you could help out with a full review of this PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, the linked article (written by me) tries to explain this technique as well. I recommend reading from the top (and skipping sections if they are boring review to you; it starts slowly) if you have time.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What you have here is surely acceptable, but I can't help feeling uneasy about reaching so deeply into the bag of tricks. So I asked ChatGPT another question in the same conversation, same URL:

https://chatgpt.com/share/68cb906b-941c-8008-9ac2-acbda6c78ca8

It came back with this (copy-pasted here for easy reference):

template <typename ArrayT, std::size_t InlineSize, typename VectorT = ArrayT>
union inline_array_or_vector {
    struct tag_view {
        bool is_inline;
    };

    struct inline_array {
        bool is_inline;                // must be first
        std::uint32_t size = 0;
        std::array<ArrayT, InlineSize> arr;
    };

    struct heap_vector {
        bool is_inline;                // must be first
        std::vector<VectorT> vec;

        heap_vector() = default;
        heap_vector(std::size_t count, VectorT value) : vec(count, value) {}
    };

    // Union members
    tag_view     tag;    // “discriminator view”
    inline_array array;
    heap_vector  vector;

    inline_array_or_vector() : array{true, 0, {}} {}

    ~inline_array_or_vector() {
        if (!is_inline()) {
            vector.~heap_vector();
        }
    }

    bool is_inline() const {
        // Well-defined: all three arms share the same first member layout,
        // so the common-initial-sequence rule lets us read tag.is_inline.
        return tag.is_inline;
    }

    // ... move ctor/assign as in the PR, using array/vector arms ...
};

WDYT?

Copy link
Contributor Author

@swolchok swolchok Sep 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ChatGPT is wrong. You can't read from an inactive member of a union because it's undefined behavior, full stop. C may allow it, but C++ does not.

(EDIT: see correction below; ChatGPT is only wrong because heap_vector isn't guaranteed standard-layout)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

C++: "It is undefined behavior to read from the member of the union that wasn't most recently written. Many compilers implement, as a non-standard language extension, the ability to read inactive members of a union." -- https://en.cppreference.com/w/cpp/language/union.html

C: "If the member used to access the contents of a union is not the same as the member last used to store a value, the object representation of the value that was stored is reinterpreted as an object representation of the new type (this is known as type punning)" -- https://en.cppreference.com/w/c/language/union.html

Copy link
Contributor Author

@swolchok swolchok Sep 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I stand partially corrected. I asked your ChatGPT conversation for a citation about the common initial sequence rule, and sure enough it does exist in C++ (which I did not realize) and it implies that this codeyour proposed tag_view change is fine, if inline_array and heap_vector are both standard-layout types. Sadly, heap_vector is not guaranteed to be standard-layout because std::vector is not guaranteed to be standard-layout, as we unfortunately ran into below. Good to know though!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation! (I posted my other comment a minute ago before seeing this. I updated that comment with a strikethrough.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A non-standardese citation for the common initial sequence rule (which, again, I had previously missed) is just "If two union members are standard-layout types, it's well-defined to examine their common subsequence on any compiler." -- https://en.cppreference.com/w/cpp/language/union.html

…st. Add static_asserts for our untagged union implementation per request.
@swolchok
Copy link
Contributor Author

Hmm, MSVC thinks that heap_vector is not a standard-layout type. Time for some research.

Looks like /MDd as opposed to /MD is what causes the change: https://godbolt.org/z/axrWb5EM3

/MDd changes to a debugging version of the standard library: https://learn.microsoft.com/en-us/cpp/build/reference/md-mt-ld-use-run-time-library?view=msvc-170

and so it appears that the problem is because std::vector is not standard-layout in the debug library version: https://godbolt.org/z/ecP95nYje

I am going to remove the is_standard_layout assertions, because as far as I know there is no reason the contained types have to be standard layout. I am keeping the offsetof assertions because the offsets do in fact have to match. (They don't have to be 0, but they really should be so I'm happy to assert that they are.)

Copy link
Collaborator

@rwgk rwgk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked some more, looks good to me.

I didn't look at every single line of the production code, and only glanced through the tests.

@oremanj If you could help out with a second set of eyes in a similar fashion, I'd be comfortable merging this (with or without the tag_view idea).

@rwgk
Copy link
Collaborator

rwgk commented Sep 18, 2025

I am going to remove the is_standard_layout assertions, because as far as I know there is no reason the contained types have to be standard layout. I am keeping the offsetof assertions because the offsets do in fact have to match. (They don't have to be 0, but they really should be so I'm happy to assert that they are.)

Hm, mostly for my curiosity, could you please explain why it'd be OK if the offset is not 0?

        bool result = false;
        std::memcpy(&result, reinterpret_cast<const char *>(this), sizeof(bool));

It seems like memcpy will copy from offset 0, no?

If not, wouldn't it have to be like this (untested)?

        bool result = false;
        static_assert(offsetof(inline_array, is_inline) == offsetof(heap_vector, is_inline), "untagged union implementation relies on this");
        std::memcpy(&result, reinterpret_cast<const char *>(this) + offsetof(inline_array, is_inline), sizeof(bool));

@swolchok
Copy link
Contributor Author

Hm, mostly for my curiosity, could you please explain why it'd be OK if the offset is not 0?

I meant that it would be ok because we could make the change you outlined. I'm fairly sure it's guaranteed to actually be zero given the code we've written.

@swolchok
Copy link
Contributor Author

most recent CI run has a deadlock on only mingw64. Haven't seen this on previous runs and it seems odd that this change would cause a deadlock on exactly mingw, so I'd like to retry. I don't see a button to do that though; will let the one straggler job finish and then I guess just push another rev?

@rwgk
Copy link
Collaborator

rwgk commented Sep 18, 2025

I can help clicking the rerun button.

@swolchok
Copy link
Contributor Author

sure enough, the mingw64 deadlock disappeared.

@rwgk
Copy link
Collaborator

rwgk commented Sep 18, 2025

For completeness: I had to cancel the CI / 🐍 (macos-latest, 3.14t, -DCMAKE_CXX_STANDARD=20) job, it was hanging. This wasn't the first time. Certainly not related to this PR.

@rwgk
Copy link
Collaborator

rwgk commented Sep 19, 2025

Thanks @oremanj for the review!

@rwgk rwgk merged commit 30748f8 into pybind:master Sep 19, 2025
146 of 148 checks passed
@github-actions github-actions bot added the needs changelog Possibly needs a changelog entry label Sep 19, 2025
@swolchok swolchok deleted the argument-small-vectors branch September 19, 2025 21:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs changelog Possibly needs a changelog entry
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants