Description
After commit 47d831f (#100514) I noticed a regression in a downstream benchmark, due to a loop no longer being vectorized. It seems like the changed cost of "llvm.umin" (from 2 to 1, given that the operation was legal for the target) impacted SimplifyCFG in a way that allowed a transform that simplified the control flow by speculating the umin together with a select statement. Unfortunately, that new code emitted by SimplifyCFG is for some reason not recognized by the loop vectorizer.
Here is an IR example cfg.ll to show what happens:
define i32 @foo(ptr %0, i32 %1) {
br label %5
3: ; preds = %14
%4 = phi i32 [ %15, %14 ]
ret i32 %4
5: ; preds = %2, %14
%6 = phi i64 [ 0, %2 ], [ %16, %14 ]
%7 = phi i32 [ 128, %2 ], [ %15, %14 ]
%8 = getelementptr inbounds i32, ptr %0, i64 %6
%9 = load i32, ptr %8, align 4
%10 = icmp sgt i32 %9, %1
br i1 %10, label %11, label %14
11: ; preds = %5
%12 = trunc nuw nsw i64 %6 to i32
%13 = tail call i32 @llvm.umin.i32(i32 %7, i32 %12)
br label %14
14: ; preds = %5, %11
%15 = phi i32 [ %13, %11 ], [ %7, %5 ]
%16 = add nuw nsw i64 %6, 1
%17 = icmp eq i64 %16, 128
br i1 %17, label %3, label %5
}
If only running the vectorizer we get:
> opt -mtriple x86_64 -passes='loop-vectorize' cfg.ll -S -o - | grep umin
%11 = call <4 x i32> @llvm.umin.v4i32(<4 x i32> %vec.phi, <4 x i32> %vec.ind)
%12 = call <4 x i32> @llvm.umin.v4i32(<4 x i32> %vec.phi1, <4 x i32> %step.add)
%rdx.minmax = call <4 x i32> @llvm.umin.v4i32(<4 x i32> %predphi, <4 x i32> %predphi4)
%16 = call i32 @llvm.vector.reduce.umin.v4i32(<4 x i32> %rdx.minmax)
%27 = tail call i32 @llvm.umin.i32(i32 %21, i32 %26)
But if first running simplifycfg there is no vectorization:
> build-all/bin/opt -mtriple x86_64 -passes='simplifycfg,loop-vectorize' cfg.ll -S -o - | grep umin
%12 = tail call i32 @llvm.umin.i32(i32 %7, i32 %11)
The transform done by simplifycfg result in this IR:
define i32 @foo(ptr %0, i32 %1) {
br label %5
3: ; preds = %5
%4 = phi i32 [ %13, %5 ]
ret i32 %4
5: ; preds = %5, %2
%6 = phi i64 [ 0, %2 ], [ %14, %5 ]
%7 = phi i32 [ 128, %2 ], [ %13, %5 ]
%8 = getelementptr inbounds i32, ptr %0, i64 %6
%9 = load i32, ptr %8, align 4
%10 = icmp sgt i32 %9, %1
%11 = trunc nuw nsw i64 %6 to i32
%12 = tail call i32 @llvm.umin.i32(i32 %7, i32 %11)
%13 = select i1 %10, i32 %12, i32 %7
%14 = add nuw nsw i64 %6, 1
%15 = icmp eq i64 %14, 128
br i1 %15, label %3, label %5
}
And loop-vectorize complains like this with -debug
:
LV: Checking a loop in 'foo' from cfg.ll
LV: Loop hints: force=? width=0 interleave=0
LV: Found a loop:
LV: Found an induction variable.
LV: PHI is not a poly recurrence.
LV: PHI is not a poly recurrence.
LV: Not vectorizing: Found an unidentified PHI %7 = phi i32 [ 128, %2 ], [ %13, %5 ]
LV: Interleaving disabled by the pass manager
LV: Can't vectorize the instructions or CFG
LV: Not vectorizing: Cannot prove legality.
Is this some kind of limitation/bug in loop-vectorize? Or is it a phase ordering problem?