-
-
Notifications
You must be signed in to change notification settings - Fork 2.9k
stage2 array access performance (boundchecks related?) #12215
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Most of the time is spent in memcpy. looks like portion of the air:
portion of llvm-ir stage 1: BoundsCheckOk4: ; preds = %BoundsCheckOk
%32 = getelementptr inbounds [10000 x i64], [10000 x i64]* %28, i64 0, i64 %30, !dbg !3086
%33 = load i64, i64* %32, align 8, !dbg !3086
%34 = load i32, i32* %a, align 4, !dbg !3087
%35 = zext i32 %34 to i64, !dbg !3087
%36 = load [10 x [10000 x i64]]*, [10 x [10000 x i64]]** %array, align 8, !dbg !3088
%37 = icmp ult i64 %35, 10, !dbg !3088
br i1 %37, label %BoundsCheckOk6, label %BoundsCheckFail5, !dbg !3088
BoundsCheckFail5: ; preds = %BoundsCheckOk4
call fastcc void @std.builtin.default_panic(%"[]u8"* @86, %std.builtin.StackTrace* null), !dbg !3088
unreachable, !dbg !3088
BoundsCheckOk6: ; preds = %BoundsCheckOk4
%38 = getelementptr inbounds [10 x [10000 x i64]], [10 x [10000 x i64]]* %36, i64 0, i64 %35, !dbg !3088
%39 = load i32, i32* %b, align 4, !dbg !3089
%40 = zext i32 %39 to i64, !dbg !3089
%41 = icmp ult i64 %40, 10000, !dbg !3090
br i1 %41, label %BoundsCheckOk8, label %BoundsCheckFail7, !dbg !3090
BoundsCheckFail7: ; preds = %BoundsCheckOk6
call fastcc void @std.builtin.default_panic(%"[]u8"* @86, %std.builtin.StackTrace* null), !dbg !3090
unreachable, !dbg !3090
BoundsCheckOk8: ; preds = %BoundsCheckOk6
%42 = getelementptr inbounds [10000 x i64], [10000 x i64]* %38, i64 0, i64 %40, !dbg !3090
%43 = load i64, i64* %42, align 8, !dbg !3090
%44 = load i32, i32* %i, align 4, !dbg !3091
%45 = zext i32 %44 to i64, !dbg !3091
%46 = call { i64, i1 } @llvm.umul.with.overflow.i64(i64 %43, i64 %45), !dbg !3092
%47 = extractvalue { i64, i1 } %46, 0, !dbg !3092
%48 = extractvalue { i64, i1 } %46, 1, !dbg !3092
br i1 %48, label %OverflowFail, label %OverflowOk, !dbg !3092 stage2: Block5: ; preds = %Then3
%31 = getelementptr inbounds [10000 x i64], [10000 x i64]* %27, i32 0, i64 %29, !dbg !4998
%32 = load i64, i64* %31, align 8, !dbg !4998
%33 = load i32, i32* %8, align 4, !dbg !4998
%34 = zext i32 %33 to i64, !dbg !4998
%35 = icmp ult i64 %34, 10, !dbg !4998
br i1 %35, label %Then6, label %Else7, !dbg !4998
Then6: ; preds = %Block5
br label %Block8, !dbg !4998
Else7: ; preds = %Block5
call fastcc void @builtin.panicOutOfBounds(i64 %34, i64 10), !dbg !4998
unreachable, !dbg !4998
Block8: ; preds = %Then6
%36 = getelementptr inbounds [10 x [10000 x i64]], [10 x [10000 x i64]]* %17, i32 0, i64 %34, !dbg !4998
%37 = bitcast [10000 x i64]* %5 to i8*, !dbg !4998
%38 = bitcast [10000 x i64]* %36 to i8*, !dbg !4998
call void @llvm.memcpy.p0i8.p0i8.i64(i8* align 8 %37, i8* align 8 %38, i64 80000, i1 false), !dbg !4998
%39 = load i32, i32* %7, align 4, !dbg !4998
%40 = zext i32 %39 to i64, !dbg !4998
%41 = icmp ult i64 %40, 10000, !dbg !4998
br i1 %41, label %Then9, label %Else10, !dbg !4998
Then9: ; preds = %Block8
br label %Block11, !dbg !4998
Else10: ; preds = %Block8
call fastcc void @builtin.panicOutOfBounds(i64 %40, i64 10000), !dbg !4998
unreachable, !dbg !4998
Block11: ; preds = %Then9
%42 = getelementptr inbounds [10000 x i64], [10000 x i64]* %5, i32 0, i64 %40, !dbg !4998
%43 = load i64, i64* %42, align 8, !dbg !4998
%44 = load i32, i32* %6, align 4, !dbg !4998
%45 = zext i32 %44 to i64, !dbg !4998
%46 = call fastcc { i64, i1 } @llvm.umul.with.overflow.i64(i64 %43, i64 %45), !dbg !4998
%47 = extractvalue { i64, i1 } %46, 0, !dbg !4998
%48 = extractvalue { i64, i1 } %46, 1, !dbg !4998
%49 = getelementptr inbounds %2, %2* %4, i32 0, i32 0, !dbg !4998
store i64 %47, i64* %49, align 8, !dbg !4998
%50 = getelementptr inbounds %2, %2* %4, i32 0, i32 1, !dbg !4998
store i1 %48, i1* %50, align 1, !dbg !4998
%51 = getelementptr inbounds %2, %2* %4, i32 0, i32 1, !dbg !4998
%52 = load i1, i1* %51, align 1, !dbg !4998
%53 = icmp eq i1 %52, false, !dbg !4998
br i1 %53, label %Then12, label %Else13, !dbg !4998
|
To everyone who is looking for a temporary solution: This solves the performance issue because it seems to be storing a pointer to the array instead of the whole array: |
Reduction: test {
var array: [10][10_000]u64 = undefined;
var array_ptr = &array;
var a: usize = 1;
_ = array_ptr[a][a];
} The LLVM IR generated for this by stage2 in define internal fastcc i16 @test3.test_0() unnamed_addr #0 {
Entry:
%0 = alloca [10000 x i64], align 8
; ... omitted for brevity
Block: ; preds = %Then
%7 = getelementptr inbounds [10 x [10000 x i64]], ptr %4, i32 0, i64 %5
call void @llvm.memcpy.p0.p0.i64(ptr align 8 %0, ptr align 8 %7, i64 80000, i1 false)
%8 = load i64, ptr %1, align 8
%9 = icmp ult i64 %8, 10000
br i1 %9, label %Then1, label %Else2
Then1: ; preds = %Block
br label %Block3
Else2: ; preds = %Block
call fastcc void @builtin.panicOutOfBounds(i64 %8, i64 10000)
unreachable
Block3: ; preds = %Then1
ret i16 0
} Notice the intermediate load/memcpy of the inner array element, which is very slow. The load indeed seems to be from the code inserted for the inbounds safety check. In Release mode, LLVM has no problem doing the optimization for us. In |
The problem boils down to this optimization in // It would be valid to fall back to the code below here that simply calls
// load(). However, as an optimization, we want to avoid unnecessary copies
// of isByRef=true types. Here, we scan forward in the current block,
// looking to see if this load dies before any side effects occur.
// In such case, we can safely return the operand without making a copy.
for (body[body_i..]) |body_inst| {
switch (self.liveness.categorizeOperand(self.air, body_inst, inst)) {
.none => continue,
.write, .noret, .complex => break :elide,
.tomb => return try self.resolveInst(ty_op.operand),
}
} else unreachable; The bounds check inserts a |
Note that the issue starts in AstGen where array and field access forces the lhs to be loaded even though the operation could just as well be done on a pointer. |
This change adds to Liveness a simple pattern match for the try-like `.condbr` blocks emitted by Sema's safety checks. This allows us to determine that these do not modify memory, which permits us to elide additional loads in the backend. As @Vexu points out in the main issue, this is probably not a complete solution on its own. We'll still want a way to reliably narrow the load/copy when performing several consecutive accesses, such as `foo.arr[x][y].z` Resolves ziglang#12215
This change adds to Liveness a simple pattern match for the try-like `.condbr` blocks emitted by Sema's safety checks. This allows us to determine that these do not modify memory, which permits us to elide additional loads in the backend. As @Vexu points out in the main issue, this is probably not a complete solution on its own. We'll still want a way to reliably narrow the load/copy when performing several consecutive accesses, such as `foo.arr[x][y].z` Resolves ziglang#12215
This change adds to Liveness a simple pattern match for the try-like `.condbr` blocks emitted by Sema's safety checks. This allows us to determine that these do not modify memory, which permits us to elide additional loads in the backend. As @Vexu points out in the main issue, this is probably not a complete solution on its own. We'll still want a way to reliably narrow the load/copy when performing several consecutive accesses, such as `foo.arr[x][y].z` Resolves ziglang#12215
This change adds to Liveness a simple pattern match for the try-like `.condbr` blocks emitted by Sema's safety checks. This allows us to determine that these do not modify memory, which permits us to elide additional loads in the backend. As @Vexu points out in the main issue, this is probably not a complete solution on its own. We'll still want a way to reliably narrow the load/copy when performing several consecutive accesses, such as `foo.arr[x][y].z` Resolves ziglang#12215
This change adds to Liveness a simple pattern match for the try-like `.condbr` blocks emitted by Sema's safety checks. This allows us to determine that these do not modify memory, which permits us to elide additional loads in the backend. As @Vexu points out in the main issue, this is probably not a complete solution on its own. We'll still want a way to reliably narrow the load/copy when performing several consecutive accesses, such as `foo.arr[x][y].z` Resolves ziglang#12215
This change adds to Liveness a simple pattern match for the try-like `.condbr` blocks emitted by Sema's safety checks. This allows us to determine that these do not modify memory, which permits us to elide additional loads in the backend. As @Vexu points out in the main issue, this is probably not a complete solution on its own. We'll still want a way to reliably narrow the load/copy when performing several consecutive accesses, such as `foo.arr[x][y].z` Resolves ziglang#12215
This change adds to Liveness a simple pattern match for the try-like `.condbr` blocks emitted by Sema's safety checks. This allows us to determine that these do not modify memory, which permits us to elide additional loads in the backend. As @Vexu points out in the main issue, this is probably not a complete solution on its own. We'll still want a way to reliably narrow the load/copy when performing several consecutive accesses, such as `foo.arr[x][y].z` Resolves ziglang#12215
This change adds to Liveness a simple pattern match for the try-like `.condbr` blocks emitted by Sema's safety checks. This allows us to determine that these do not modify memory, which permits us to elide additional loads in the backend. As @Vexu points out in the main issue, this is probably not a complete solution on its own. We'll still want a way to reliably narrow the load/copy when performing several consecutive accesses, such as `foo.arr[x][y].z` Resolves ziglang#12215
This change adds to Liveness a simple pattern match for the try-like `.condbr` blocks emitted by Sema's safety checks. This allows us to determine that these do not modify memory, which permits us to elide additional loads in the backend. As @Vexu points out in the main issue, this is probably not a complete solution on its own. We'll still want a way to reliably narrow the load/copy when performing several consecutive accesses, such as `foo.arr[x][y].z` Resolves ziglang#12215
This change adds to Liveness a simple pattern match for the try-like `.condbr` blocks emitted by Sema's safety checks. This allows us to determine that these do not modify memory, which permits us to elide additional loads in the backend. As @Vexu points out in the main issue, this is probably not a complete solution on its own. We'll still want a way to reliably narrow the load/copy when performing several consecutive accesses, such as `foo.arr[x][y].z` Resolves ziglang#12215
This change adds to Liveness a simple pattern match for the try-like `.condbr` blocks emitted by Sema's safety checks. This allows us to determine that these do not modify memory, which permits us to elide additional loads in the backend. As @Vexu points out in the main issue, this is probably not a complete solution on its own. We'll still want a way to reliably narrow the load/copy when performing several consecutive accesses, such as `foo.arr[x][y].z` Resolves ziglang#12215
This change adds to Liveness a simple pattern match for the try-like `.condbr` blocks emitted by Sema's safety checks. This allows us to determine that these do not modify memory, which permits us to elide additional loads in the backend. As @Vexu points out in the main issue, this is probably not a complete solution on its own. We'll still want a way to reliably narrow the load/copy when performing several consecutive accesses, such as `foo.arr[x][y].z` Resolves ziglang#12215
This change adds to Liveness a simple pattern match for the try-like `.condbr` blocks emitted by Sema's safety checks. This allows us to determine that these do not modify memory, which permits us to elide additional loads in the backend. As @Vexu points out in the main issue, this is probably not a complete solution on its own. We'll still want a way to reliably narrow the load/copy when performing several consecutive accesses, such as `foo.arr[x][y].z` Resolves #12215
…mory followup to [25d3713] Resolves ziglang#12215 Previous code didn't account for the extra unreach() that now exists in the air: ``` %29!= block(void, { %30!= cond_br(%22!, { %31!= br(%29, @Air.Inst.Ref.void_value) }, { %2! %15! %27!= call(%26, [%19!, %21]) %28!= unreach() }) } %22!) ```
followup to [25d3713] Resolves ziglang#12215 Previous code didn't account for the extra unreach() that now exists in the air: ``` %29!= block(void, { %30!= cond_br(%22!, { %31!= br(%29, @Air.Inst.Ref.void_value) }, { %2! %15! %27!= call(%26, [%19!, %21]) %28!= unreach() }) } %22!) ```
followup to [25d3713] Resolves ziglang#12215 Previous code didn't account for the extra unreach() that now exists in the air: ``` %29!= block(void, { %30!= cond_br(%22!, { %31!= br(%29, @Air.Inst.Ref.void_value) }, { %2! %15! %27!= call(%26, [%19!, %21]) %28!= unreach() }) } %22!) ```
Uh oh!
There was an error while loading. Please reload this page.
Zig Version
0.10.0-dev.3286+9ec3f611c
Steps to Reproduce
Expected Behavior
test passes in <1s
(and does with stage 1)
Actual Behavior
test passes in > 20s (stage 2)
in debug and releasesafe modes.
seems to work fine in releasefast. (there's no memcpy in the llvmir)
The text was updated successfully, but these errors were encountered: