-
-
Notifications
You must be signed in to change notification settings - Fork 2.8k
Allocgate #10055
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allocgate #10055
Conversation
This increases the size of the |
I think @daurnimator has a good point. Have we tried this instead? const Allocator = struct {
ptr: *c_void,
vtable: *const VTable,
const VTable = struct {
alloc: fn (ptr: *c_void, len: usize, ptr_align: u29, len_align: u29, ret_addr: usize) Error![]u8,
resize: fn (ptr: *c_void, buf: []u8, buf_align: u29, new_len: usize, len_align: u29, ret_addr: usize) Error!usize,
};
}; This might also address the issue that @ominitay found where LLVM did not do the devirtualization, since the vtable pointer is marked constant and could point to global static constant memory. It also lets us drop the |
I have done some extremely simple comparisons using code where the overhead of the virtual call is most noticeable. FixedBufferAllocator
fba codeconst std = @import("std");
var backing_buffer: [std.mem.page_size * 200]u8 = undefined;
var fba = std.heap.FixedBufferAllocator.init(&backing_buffer);
pub fn main() !void {
const allocator = fba.allocator(); // or &fba.allocator;
var to_deinit: [backing_buffer.len]*u8 = undefined;
var i: usize = 0;
while (true) : (i += 1) {
const ptr = allocator.create(u8) catch break;
to_deinit[i] = ptr;
}
for (to_deinit) |ptr| {
allocator.destroy(ptr);
}
} GeneralPurposeAllocator
gpa codeconst std = @import("std");
const amount = std.mem.page_size * 200;
pub fn main() !void {
var gpa = std.heap.GeneralPurposeAllocator(.{}){};
const allocator = gpa.allocator(); // or &gpa.allocator;
defer _ = gpa.deinit();
var to_deinit: [amount]*u8 = undefined;
var i: usize = 0;
while (i < amount) : (i += 1) {
const ptr = allocator.create(u8) catch break;
to_deinit[i] = ptr;
}
for (to_deinit) |ptr| {
allocator.destroy(ptr);
}
}
More extensive testing with non-trivial codebases would be required to see what the real world impact of these changes are. |
Do I understand correctly that the strategy with |
@andrewrk yes, seems so |
I'm finding in my benchmarks that the |
In fact, the generated assembly for -new and -new-vtable are byte-for-byte identical. This does not apply to the GPA example though. |
Theoretically, won't Zig transform these to pass-by-reference if it's more optimal? If yes, is the vtable method simply a practical "workaround" for Zig (or LLVM) not recognizing the optimization potential in this particular situation? |
I have been looking closer at this with perf, valgrind, etc. and they have shown either no change or worse (cache misses, stalled cycles, etc) for the vtable version. Also as @ominitay has shown the vtable and non-vtable version either differ by less than the margin of error or the vtable is substantially worse. I am going to revert the vtable change, if anyone can provide an example of where the 3 usize size of Allocator is a detriment it can be reconsidered. |
What if |
This would not work, as Example: |
Ah, yeah that makes sense. It becomes a kind of function coloring almost.
…On Sun, Oct 31, 2021 at 4:26 AM Felix Queißner ***@***.***> wrote:
What if allocator was a comptime argument? Is it common to dynamically
change which allocator is in use at runtime? I don't think I've seen code
that does that. If allocator became a comptime argument, would that mean
the cost of passing the Allocator around becomes zero? I also wonder what
the impact to zig compilation time would look like
This would not work, as comptime arguments must be constant at comptime.
Something like FixedBufferAllocator would not work. If you only want to
pass fn allocates(comptime Allocator: type, allocator: Allocator) !void,
you'll make everything generic, and compile times will expode. You'll also
be unable to use the same module with two different allocators in your
project, as you will duplicate the code and all functions/variables in that
module.
Example:
Everything that takes an ArrayList(u8) argument will suddenly be generic
over the array list, as it would ArrayList(Allocator, u8). I assume this
won't scale at all, will bloat the executables similar to std.fmt or
std.Writer
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#10055 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAFNGS2LVAY5YNVCDSOBG2DUJURWTANCNFSM5G6HL4CA>
.
|
A while ago on IRC @andrewrk mentioned this as one of several issues with the allocator interface design:
In particular, I wonder how these changes interacts with the desire to make a comptime alignement or type available to the allocator implementations. It would be best to do one massively breaking change instead of two if at all possible. |
I'm curious about @andrewrk reasoning on this. If it really has no benefit then we can remove it, but I believe there's use cases for it (i.e. ArrayList), but maybe Andrew thinks those use cases can use another mechanism and be just as good? For most cases, realloc is fine. The only case I think think of that could benefit from having more than realloc is if you just want to know how much space you have available, because you're not sure how much you'll need yet but you would benefit from knowing the full amount you have available to you. Without an interface to get the available length, you might realloc more than is available but you end up not actually needing it, so you would end up moving memory when you didn't need to. Example: You're putting data into an ArrayList. It has allocated 200 bytes. The allocator reserved 256 bytes for it, but ArrayList doesn't know that (because we removed the mechanism to query this). The ArrayList is full but doesn't know how much it will need, so it just asks for double its size, 400 bytes. The allocator can't expand its current slot, so it has to reallocate a new larger slot and move all the memory over. In the end, the ArrayList only actually needed 30 more bytes, so this reallocation/memcpy was wasted. Is there a way to avoid this scenario if we remove the mechanism to know how much space is actually reserved by the allocator? |
Here's an idea that should remove the need for ///
/// A new allocator API. Instead of an allocator providing 2 function pointers, they
/// provide a single function pointer that takes a tagged union represnting an "AllocatorRequest".
/// This allows us to enhance/scale the allocator interface with more operations without having
/// to add additional function pointers.
///
/// Using this new ability to add more operations, we can remove the `len_align` parameter to the
/// allocation functions and add a `query_full_capacity` request intead.
///
/// We can also use this new ability to easily add operations to support comptime aligned allocations.
///
/// Note: I've added context pointer fo the AllocFn so we don't need to depend on `@fieldParentPtr`
pub const AllocFn = fn (self: *c_void, request: *AllocatorRequest) void;
pub const AllocatorRequest = union(enum) {
alloc: struct { len: usize, align: u29, ret_addr: usize },
// means we can allocation with comptime alignments up to 16
alloc_align2: struct { len: usize, ret_addr: usize },
alloc_align4: struct { len: usize, ret_addr: usize },
alloc_align8: struct { len: usize, ret_addr: usize },
alloc_align16: struct { len: usize, ret_addr: usize },
resize: Resize,
/// returns the full capacity available to an allocation
query_full_capacity: struct { buf: []u8, align: u29 },
pub const Resize = struct {
buf: []u8,
align: u29,
new_len: usize,
ret_addr: usize,
};
}
|
|
Oh yeah now that I think about it you might be right here. I think I need some clarification from Andrew about what he means by comptime-known alignment. My suggestion here was based on an old suggestion from him where he added new allocation function pointers for a limited set of alignments. That being said, using function pointers don't have the same downside as
I assume by "dubious" you mean "slower than an alternative". I can't say here, but I think we should measure if we decide to add more "operations", either via union(enum) or vtable.
We need to pass these 48 bytes of memory somehow, we can't get around that. I can't say which mechanism is faster though. I will give you that with a tagged union, there will be some cases that hog more stack memory than they need to, but in others I'm guessing they wouldn't? Does a tagged union with a "small activated field" only take up as much stack memory as that field needs? Would be interested to know.
You could be right here and I think you have more insight on this than myself. I think the approach of breaking up our 2 operations into smaller ones has merit, whether it's better to use a vtable or a tagged union, I'm not sure. |
@leecannon I'd like to merge this without a6246c8. I think we should change one thing at a time and now we have the luxury of looking at how the graphs on https://ziglang.org/perf/ will change with each thing. Can we do the interface change alone? |
Okay I have pulled the shrink change out and will make a new PR for it once this is merged. And just to note the shrink change is a correctness issue not a performance improvement, which is even more reason to split it out. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work. Thank you for taking on such a monumental task.
Ready to merge as soon as we have the green light.
Looking at https://ziglang.org/perf/ - I see major regression with maxrss on a few benchmarks after merging this. For example, the arena-allocator benchmark went from using 4 MiB to 11 GiB. |
Holy crap! There must be a broken free somewhere, the increased size of Allocator can only have a maximum of 2x increase in memory use if every piece of memory in the program is an Allocator. |
Allocators are now passed by value instead of reference. See ziglang/zig#10055 Closes Vexu#163
Allocators are now passed by value instead of reference. See ziglang/zig#10055 Closes Vexu#163
Allocators are now passed by value instead of reference. See ziglang/zig#10055 Closes #163
* Update to new Allocator API See ziglang/zig#10055 Co-authored-by: Komari Spaghetti <[email protected]> Co-authored-by: Nameless <[email protected]>
This PR is a refactor of the Allocator interface to allow inlining more often and to prevent the issue regarding
@fieldParentPointer
and dirty tracking that LLVM suffers from as explained hereCloses #10052