Skip to content

Allocgate #10055

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Dec 1, 2021
Merged

Allocgate #10055

merged 10 commits into from
Dec 1, 2021

Conversation

leecannon
Copy link
Contributor

This PR is a refactor of the Allocator interface to allow inlining more often and to prevent the issue regarding @fieldParentPointer and dirty tracking that LLVM suffers from as explained here

Closes #10052

@daurnimator
Copy link
Contributor

This increases the size of the Allocator argument passed to a huge number of functions to be 3*usize.
I expect this to have significant negative effects.

@andrewrk
Copy link
Member

andrewrk commented Oct 29, 2021

I think @daurnimator has a good point. Have we tried this instead?

const Allocator = struct {
    ptr: *c_void,
    vtable: *const VTable,

    const VTable = struct {
        alloc: fn (ptr: *c_void, len: usize, ptr_align: u29, len_align: u29, ret_addr: usize) Error![]u8,
        resize: fn (ptr: *c_void, buf: []u8, buf_align: u29, new_len: usize, len_align: u29, ret_addr: usize) Error!usize,
    };
};

This might also address the issue that @ominitay found where LLVM did not do the devirtualization, since the vtable pointer is marked constant and could point to global static constant memory. It also lets us drop the Fn suffix.

@leecannon
Copy link
Contributor Author

leecannon commented Oct 29, 2021

I have done some extremely simple comparisons using code where the overhead of the virtual call is most noticeable.

FixedBufferAllocator

Command Mean [ms] Min [ms] Max [ms] Relative
./trivial_fba_old 10.8 ± 1.9 5.7 12.5 2.59 ± 0.57
./trivial_fba_new 4.6 ± 1.1 2.0 6.4 1.10 ± 0.31
./trivial_fba_new_vtable 4.2 ± 0.6 1.2 7.4 1.00
fba code
const std = @import("std");

var backing_buffer: [std.mem.page_size * 200]u8 = undefined;
var fba = std.heap.FixedBufferAllocator.init(&backing_buffer);

pub fn main() !void {
    const allocator = fba.allocator(); // or &fba.allocator;

    var to_deinit: [backing_buffer.len]*u8 = undefined;

    var i: usize = 0;
    while (true) : (i += 1) {
        const ptr = allocator.create(u8) catch break;
        to_deinit[i] = ptr;
    }

    for (to_deinit) |ptr| {
        allocator.destroy(ptr);
    }
}

GeneralPurposeAllocator

Command Mean [ms] Min [ms] Max [ms] Relative
./trivial_gpa_old 46.7 ± 7.1 28.1 52.3 1.16 ± 0.33
./trivial_gpa_new 43.7 ± 8.0 24.0 51.5 1.08 ± 0.33
./trivial_gpa_new_vtable 40.4 ± 9.9 25.7 49.8 1.00
gpa code
const std = @import("std");

const amount = std.mem.page_size * 200;

pub fn main() !void {
    var gpa = std.heap.GeneralPurposeAllocator(.{}){};
    const allocator = gpa.allocator(); // or &gpa.allocator;
    defer _ = gpa.deinit();

    var to_deinit: [amount]*u8 = undefined;

    var i: usize = 0;
    while (i < amount) : (i += 1) {
        const ptr = allocator.create(u8) catch break;
        to_deinit[i] = ptr;
    }

    for (to_deinit) |ptr| {
        allocator.destroy(ptr);
    }
}

More extensive testing with non-trivial codebases would be required to see what the real world impact of these changes are.

@andrewrk
Copy link
Member

Do I understand correctly that the strategy with vtable: *const VTable, seems to perform the best of all?

@leecannon
Copy link
Contributor Author

@andrewrk yes, seems so

@ominitay
Copy link
Contributor

I'm finding in my benchmarks that the VTable consistently slows things down compared to without. example1 is the same as the GPA example which @leecannon sent.
bench

@ominitay
Copy link
Contributor

And regarding the FBA example, the same pattern is shown
bench

@ominitay
Copy link
Contributor

In fact, the generated assembly for -new and -new-vtable are byte-for-byte identical. This does not apply to the GPA example though.

@jayschwa
Copy link
Contributor

This increases the size of the Allocator argument passed to a huge number of functions to be 3*usize. I expect this to have significant negative effects.

Theoretically, won't Zig transform these to pass-by-reference if it's more optimal? If yes, is the vtable method simply a practical "workaround" for Zig (or LLVM) not recognizing the optimization potential in this particular situation?

@leecannon
Copy link
Contributor Author

leecannon commented Oct 29, 2021

I have been looking closer at this with perf, valgrind, etc. and they have shown either no change or worse (cache misses, stalled cycles, etc) for the vtable version.

Also as @ominitay has shown the vtable and non-vtable version either differ by less than the margin of error or the vtable is substantially worse.

I am going to revert the vtable change, if anyone can provide an example of where the 3 usize size of Allocator is a detriment it can be reconsidered.

@Jarred-Sumner
Copy link
Contributor

Jarred-Sumner commented Oct 31, 2021

What if allocator was a comptime argument? Is it common to dynamically change which allocator is in use at runtime? I don't think I've seen code that does that. If allocator became a comptime argument, would that mean the cost of passing the Allocator around becomes zero? I also wonder what the impact to zig compilation time would look like

@ikskuh
Copy link
Contributor

ikskuh commented Oct 31, 2021

What if allocator was a comptime argument? Is it common to dynamically change which allocator is in use at runtime? I don't think I've seen code that does that. If allocator became a comptime argument, would that mean the cost of passing the Allocator around becomes zero? I also wonder what the impact to zig compilation time would look like

This would not work, as comptime arguments must be constant at comptime. Something like FixedBufferAllocator would not work. If you only want to pass fn allocates(comptime Allocator: type, allocator: Allocator) !void, you'll make everything generic, and compile times will expode. You'll also be unable to use the same module with two different allocators in your project, as you will duplicate the code and all functions/variables in that module.

Example:
Everything that takes an ArrayList(u8) argument will suddenly be generic over the array list, as it would ArrayList(Allocator, u8). I assume this won't scale at all, will bloat the executables similar to std.fmt or std.Writer

@Jarred-Sumner
Copy link
Contributor

Jarred-Sumner commented Oct 31, 2021 via email

@ifreund
Copy link
Member

ifreund commented Nov 1, 2021

A while ago on IRC @andrewrk mentioned this as one of several issues with the allocator interface design:

00:55 <andrewrk> I want to do another design pass on the Allocator interface 
00:55 <andrewrk> a handful of design issues has piled up
00:55 <andrewrk> * the @fieldParentPtr way of abstracting is fundamentally more difficult to optimize than a vtable strategy
00:56 <andrewrk> * the mechanism to get more bytes than you asked for is dubious and might be at worst doing more harm than good and at best a bunch of complication for no benefit
00:57 <andrewrk> * I want allocator implementations to have access to a comptime alignment. maybe even the type??
00:57 <andrewrk> * @memset to undefined should happen in the Allocator implementations rather than the interface to avoid redundant memsets in safe modes
00:59 <andrewrk> * the API is a bit of a mess

In particular, I wonder how these changes interacts with the desire to make a comptime alignement or type available to the allocator implementations. It would be best to do one massively breaking change instead of two if at all possible.

@marler8997
Copy link
Contributor

marler8997 commented Nov 3, 2021

  • the mechanism to get more bytes than you asked for is dubious and might be at worst doing more harm than good and at best a bunch of complication for no benefit

I'm curious about @andrewrk reasoning on this. If it really has no benefit then we can remove it, but I believe there's use cases for it (i.e. ArrayList), but maybe Andrew thinks those use cases can use another mechanism and be just as good?

For most cases, realloc is fine. The only case I think think of that could benefit from having more than realloc is if you just want to know how much space you have available, because you're not sure how much you'll need yet but you would benefit from knowing the full amount you have available to you. Without an interface to get the available length, you might realloc more than is available but you end up not actually needing it, so you would end up moving memory when you didn't need to.

Example:

You're putting data into an ArrayList. It has allocated 200 bytes. The allocator reserved 256 bytes for it, but ArrayList doesn't know that (because we removed the mechanism to query this). The ArrayList is full but doesn't know how much it will need, so it just asks for double its size, 400 bytes. The allocator can't expand its current slot, so it has to reallocate a new larger slot and move all the memory over. In the end, the ArrayList only actually needed 30 more bytes, so this reallocation/memcpy was wasted. Is there a way to avoid this scenario if we remove the mechanism to know how much space is actually reserved by the allocator?

@marler8997
Copy link
Contributor

marler8997 commented Nov 3, 2021

Here's an idea that should remove the need for len_align, remove the need for the @fieldParentPtr mechanism and provide access to comptime alignment

///
/// A new allocator API.  Instead of an allocator providing 2 function pointers, they
/// provide a single function pointer that takes a tagged union represnting an "AllocatorRequest".
/// This allows us to enhance/scale the allocator interface with more operations without having
/// to add additional function pointers.
///
/// Using this new ability to add more operations, we can remove the `len_align` parameter to the
/// allocation functions and add a `query_full_capacity` request intead.
///
/// We can also use this new ability to easily add operations to support comptime aligned allocations.
///
/// Note: I've added context pointer fo the AllocFn so we don't need to depend on `@fieldParentPtr`
pub const AllocFn = fn (self: *c_void, request: *AllocatorRequest) void;
pub const AllocatorRequest = union(enum) {

    alloc: struct { len: usize, align: u29, ret_addr: usize },

    // means we can allocation with comptime alignments up to 16
    alloc_align2: struct { len: usize, ret_addr: usize },
    alloc_align4: struct { len: usize, ret_addr: usize },
    alloc_align8: struct { len: usize, ret_addr: usize },
    alloc_align16: struct { len: usize, ret_addr: usize },

    resize: Resize,

    /// returns the full capacity available to an allocation
    query_full_capacity: struct { buf: []u8, align: u29 },

    pub const Resize = struct {
        buf: []u8,
        align: u29,
        new_len: usize,
        ret_addr: usize,
    };
}

@SpexGuy
Copy link
Contributor

SpexGuy commented Nov 3, 2021

union (enum) with runtime known fields has a runtime known tag, so I don't see how this makes anything comptime-known. Except that you could make specialized code paths, but you can do that anyway with the current pattern, and it's not something we should force allocators to do. Running everything through one function seems potentially dubious, since in the case where the function is not inlined it's a lot of unnecessary branching. This pattern means that in order to query the size of an allocation, you need to construct a 48 byte stack value in memory and then pass the address of that into a virtual function. While this does solve the api problems, so does a vtable, and I think the vtable approach is probably better/faster/easier to optimize.

@marler8997
Copy link
Contributor

union (enum) with runtime known fields has a runtime known tag, so I don't see how this makes anything comptime-known. Except that you could make specialized code paths, but you can do that anyway with the current pattern, and it's not something we should force allocators to do.

Oh yeah now that I think about it you might be right here. I think I need some clarification from Andrew about what he means by comptime-known alignment. My suggestion here was based on an old suggestion from him where he added new allocation function pointers for a limited set of alignments. That being said, using function pointers don't have the same downside as union (enum) (like you say) since function pointers don't need to branch like a tagged union would.

Running everything through one function seems potentially dubious, since in the case where the function is not inlined it's a lot of unnecessary branching.

I assume by "dubious" you mean "slower than an alternative". I can't say here, but I think we should measure if we decide to add more "operations", either via union(enum) or vtable.

This pattern means that in order to query the size of an allocation, you need to construct a 48 byte stack value in memory and then pass the address of that into a virtual function.

We need to pass these 48 bytes of memory somehow, we can't get around that. I can't say which mechanism is faster though. I will give you that with a tagged union, there will be some cases that hog more stack memory than they need to, but in others I'm guessing they wouldn't? Does a tagged union with a "small activated field" only take up as much stack memory as that field needs? Would be interested to know.

While this does solve the api problems, so does a vtable, and I think the vtable approach is probably better/faster/easier to optimize.

You could be right here and I think you have more insight on this than myself. I think the approach of breaking up our 2 operations into smaller ones has merit, whether it's better to use a vtable or a tagged union, I'm not sure.

@andrewrk
Copy link
Member

@leecannon I'd like to merge this without a6246c8. I think we should change one thing at a time and now we have the luxury of looking at how the graphs on https://ziglang.org/perf/ will change with each thing.

Can we do the interface change alone?

@leecannon
Copy link
Contributor Author

Okay I have pulled the shrink change out and will make a new PR for it once this is merged.

And just to note the shrink change is a correctness issue not a performance improvement, which is even more reason to split it out.

Copy link
Member

@andrewrk andrewrk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work. Thank you for taking on such a monumental task.

Ready to merge as soon as we have the green light.

@andrewrk andrewrk merged commit 7355a20 into ziglang:master Dec 1, 2021
@leecannon leecannon deleted the allocator_refactor branch December 1, 2021 02:55
Snektron added a commit to Snektron/vulkan-zig that referenced this pull request Dec 1, 2021
@andrewrk
Copy link
Member

andrewrk commented Dec 1, 2021

Looking at https://ziglang.org/perf/ - I see major regression with maxrss on a few benchmarks after merging this. For example, the arena-allocator benchmark went from using 4 MiB to 11 GiB.

@SpexGuy
Copy link
Contributor

SpexGuy commented Dec 1, 2021

Holy crap! There must be a broken free somewhere, the increased size of Allocator can only have a maximum of 2x increase in memory use if every piece of memory in the program is an Allocator.

ehaas added a commit to ehaas/arocc that referenced this pull request Dec 4, 2021
Allocators are now passed by value instead of reference.
See ziglang/zig#10055

Closes Vexu#163
ehaas added a commit to ehaas/arocc that referenced this pull request Dec 4, 2021
Allocators are now passed by value instead of reference.
See ziglang/zig#10055

Closes Vexu#163
Vexu pushed a commit to Vexu/arocc that referenced this pull request Dec 6, 2021
Allocators are now passed by value instead of reference.
See ziglang/zig#10055

Closes #163
Snektron added a commit to Snektron/vulkan-zig that referenced this pull request Dec 8, 2021
Hejsil added a commit to Hejsil/mecha that referenced this pull request Dec 16, 2021
* Update to new Allocator API

See ziglang/zig#10055

Co-authored-by: Komari Spaghetti <[email protected]>
Co-authored-by: Nameless <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allocgate
10 participants