-
Notifications
You must be signed in to change notification settings - Fork 13.4k
Reapply "[AArch64][SVE] Improve fixed-length addressing modes. (#130263)" #130625
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reapply "[AArch64][SVE] Improve fixed-length addressing modes. (#130263)" #130625
Conversation
@llvm/pr-subscribers-backend-aarch64 Author: Ricardo Jesus (rj-jesus) ChangesThis restores commit f01e760. The original patch from #129732 exposed what seems to be a bug in Currently, the offset returned by Although this seems to affect both VSCALE-based and Constant-based offsets, I believe we didn't come across it earlier because we don't generate combinations of VSCALE offsets + fixed vectors often. Enabling the Constant-based path made the problem (assuming it is a problem) obvious because combinations of Constant offsets + fixed vectors are common. To work around the issue temporarily, I added an early exit to the Constant-based path for fixed vector types. I think the long-term solution is to set What do you think? Full diff: https://github.com/llvm/llvm-project/pull/130625.diff 4 Files Affected:
diff --git a/clang/test/CodeGen/AArch64/sve-vector-bits-codegen.c b/clang/test/CodeGen/AArch64/sve-vector-bits-codegen.c
index 0ed14b4b3b793..1391a1b09fbd1 100644
--- a/clang/test/CodeGen/AArch64/sve-vector-bits-codegen.c
+++ b/clang/test/CodeGen/AArch64/sve-vector-bits-codegen.c
@@ -13,12 +13,9 @@
void func(int *restrict a, int *restrict b) {
// CHECK-LABEL: func
-// CHECK256-COUNT-1: str
-// CHECK256-COUNT-7: st1w
-// CHECK512-COUNT-1: str
-// CHECK512-COUNT-3: st1w
-// CHECK1024-COUNT-1: str
-// CHECK1024-COUNT-1: st1w
+// CHECK256-COUNT-8: str
+// CHECK512-COUNT-4: str
+// CHECK1024-COUNT-2: str
// CHECK2048-COUNT-1: st1w
#pragma clang loop vectorize(enable)
for (int i = 0; i < 64; ++i)
diff --git a/llvm/lib/Target/AArch64/AArch64ISelDAGToDAG.cpp b/llvm/lib/Target/AArch64/AArch64ISelDAGToDAG.cpp
index 3ca9107cb2ce5..d338c22267885 100644
--- a/llvm/lib/Target/AArch64/AArch64ISelDAGToDAG.cpp
+++ b/llvm/lib/Target/AArch64/AArch64ISelDAGToDAG.cpp
@@ -7380,12 +7380,24 @@ bool AArch64DAGToDAGISel::SelectAddrModeIndexedSVE(SDNode *Root, SDValue N,
return false;
SDValue VScale = N.getOperand(1);
- if (VScale.getOpcode() != ISD::VSCALE)
+ int64_t MulImm = std::numeric_limits<int64_t>::max();
+ if (VScale.getOpcode() == ISD::VSCALE) {
+ MulImm = cast<ConstantSDNode>(VScale.getOperand(0))->getSExtValue();
+ } else if (auto C = dyn_cast<ConstantSDNode>(VScale)) {
+ int64_t ByteOffset = C->getSExtValue();
+ const auto KnownVScale =
+ Subtarget->getSVEVectorSizeInBits() / AArch64::SVEBitsPerBlock;
+
+ if (!KnownVScale || ByteOffset % KnownVScale != 0 ||
+ !MemVT.isScalableVector())
+ return false;
+
+ MulImm = ByteOffset / KnownVScale;
+ } else
return false;
TypeSize TS = MemVT.getSizeInBits();
int64_t MemWidthBytes = static_cast<int64_t>(TS.getKnownMinValue()) / 8;
- int64_t MulImm = cast<ConstantSDNode>(VScale.getOperand(0))->getSExtValue();
if ((MulImm % MemWidthBytes) != 0)
return false;
diff --git a/llvm/lib/Target/AArch64/AArch64Subtarget.h b/llvm/lib/Target/AArch64/AArch64Subtarget.h
index c6eb77e3bc3ba..f5ffc72cae537 100644
--- a/llvm/lib/Target/AArch64/AArch64Subtarget.h
+++ b/llvm/lib/Target/AArch64/AArch64Subtarget.h
@@ -391,7 +391,7 @@ class AArch64Subtarget final : public AArch64GenSubtargetInfo {
void mirFileLoaded(MachineFunction &MF) const override;
// Return the known range for the bit length of SVE data registers. A value
- // of 0 means nothing is known about that particular limit beyong what's
+ // of 0 means nothing is known about that particular limit beyond what's
// implied by the architecture.
unsigned getMaxSVEVectorSizeInBits() const {
assert(isSVEorStreamingSVEAvailable() &&
@@ -405,6 +405,16 @@ class AArch64Subtarget final : public AArch64GenSubtargetInfo {
return MinSVEVectorSizeInBits;
}
+ // Return the known bit length of SVE data registers. A value of 0 means the
+ // length is unkown beyond what's implied by the architecture.
+ unsigned getSVEVectorSizeInBits() const {
+ assert(isSVEorStreamingSVEAvailable() &&
+ "Tried to get SVE vector length without SVE support!");
+ if (MinSVEVectorSizeInBits == MaxSVEVectorSizeInBits)
+ return MaxSVEVectorSizeInBits;
+ return 0;
+ }
+
bool useSVEForFixedLengthVectors() const {
if (!isSVEorStreamingSVEAvailable())
return false;
diff --git a/llvm/test/CodeGen/AArch64/sve-fixed-length-offsets.ll b/llvm/test/CodeGen/AArch64/sve-fixed-length-offsets.ll
new file mode 100644
index 0000000000000..84ab5493b03ee
--- /dev/null
+++ b/llvm/test/CodeGen/AArch64/sve-fixed-length-offsets.ll
@@ -0,0 +1,472 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
+; RUN: llc -mtriple=aarch64-linux-gnu -mattr=+sve < %s | FileCheck %s
+; RUN: llc -mtriple=aarch64-linux-gnu -mattr=+sve -aarch64-sve-vector-bits-min=128 -aarch64-sve-vector-bits-max=128 < %s | FileCheck %s --check-prefix=CHECK-128
+; RUN: llc -mtriple=aarch64-linux-gnu -mattr=+sve -aarch64-sve-vector-bits-min=256 -aarch64-sve-vector-bits-max=256 < %s | FileCheck %s --check-prefix=CHECK-256
+; RUN: llc -mtriple=aarch64-linux-gnu -mattr=+sve -aarch64-sve-vector-bits-min=512 -aarch64-sve-vector-bits-max=512 < %s | FileCheck %s --check-prefix=CHECK-512
+; RUN: llc -mtriple=aarch64-linux-gnu -mattr=+sve -aarch64-sve-vector-bits-min=1024 -aarch64-sve-vector-bits-max=1024 < %s | FileCheck %s --check-prefix=CHECK-1024
+; RUN: llc -mtriple=aarch64-linux-gnu -mattr=+sve -aarch64-sve-vector-bits-min=2048 -aarch64-sve-vector-bits-max=2048 < %s | FileCheck %s --check-prefix=CHECK-2048
+
+define void @nxv16i8(ptr %ldptr, ptr %stptr) {
+; CHECK-LABEL: nxv16i8:
+; CHECK: // %bb.0:
+; CHECK-NEXT: ptrue p0.b
+; CHECK-NEXT: mov w8, #256 // =0x100
+; CHECK-NEXT: ld1b { z0.b }, p0/z, [x0, x8]
+; CHECK-NEXT: st1b { z0.b }, p0, [x1, x8]
+; CHECK-NEXT: ret
+;
+; CHECK-128-LABEL: nxv16i8:
+; CHECK-128: // %bb.0:
+; CHECK-128-NEXT: ldr z0, [x0, #16, mul vl]
+; CHECK-128-NEXT: str z0, [x1, #16, mul vl]
+; CHECK-128-NEXT: ret
+;
+; CHECK-256-LABEL: nxv16i8:
+; CHECK-256: // %bb.0:
+; CHECK-256-NEXT: ldr z0, [x0, #8, mul vl]
+; CHECK-256-NEXT: str z0, [x1, #8, mul vl]
+; CHECK-256-NEXT: ret
+;
+; CHECK-512-LABEL: nxv16i8:
+; CHECK-512: // %bb.0:
+; CHECK-512-NEXT: ldr z0, [x0, #4, mul vl]
+; CHECK-512-NEXT: str z0, [x1, #4, mul vl]
+; CHECK-512-NEXT: ret
+;
+; CHECK-1024-LABEL: nxv16i8:
+; CHECK-1024: // %bb.0:
+; CHECK-1024-NEXT: ldr z0, [x0, #2, mul vl]
+; CHECK-1024-NEXT: str z0, [x1, #2, mul vl]
+; CHECK-1024-NEXT: ret
+;
+; CHECK-2048-LABEL: nxv16i8:
+; CHECK-2048: // %bb.0:
+; CHECK-2048-NEXT: ldr z0, [x0, #1, mul vl]
+; CHECK-2048-NEXT: str z0, [x1, #1, mul vl]
+; CHECK-2048-NEXT: ret
+ %ldoff = getelementptr inbounds nuw i8, ptr %ldptr, i64 256
+ %stoff = getelementptr inbounds nuw i8, ptr %stptr, i64 256
+ %x = load <vscale x 16 x i8>, ptr %ldoff, align 1
+ store <vscale x 16 x i8> %x, ptr %stoff, align 1
+ ret void
+}
+
+define void @nxv8i16(ptr %ldptr, ptr %stptr) {
+; CHECK-LABEL: nxv8i16:
+; CHECK: // %bb.0:
+; CHECK-NEXT: ptrue p0.h
+; CHECK-NEXT: mov x8, #128 // =0x80
+; CHECK-NEXT: ld1h { z0.h }, p0/z, [x0, x8, lsl #1]
+; CHECK-NEXT: st1h { z0.h }, p0, [x1, x8, lsl #1]
+; CHECK-NEXT: ret
+;
+; CHECK-128-LABEL: nxv8i16:
+; CHECK-128: // %bb.0:
+; CHECK-128-NEXT: ldr z0, [x0, #16, mul vl]
+; CHECK-128-NEXT: str z0, [x1, #16, mul vl]
+; CHECK-128-NEXT: ret
+;
+; CHECK-256-LABEL: nxv8i16:
+; CHECK-256: // %bb.0:
+; CHECK-256-NEXT: ldr z0, [x0, #8, mul vl]
+; CHECK-256-NEXT: str z0, [x1, #8, mul vl]
+; CHECK-256-NEXT: ret
+;
+; CHECK-512-LABEL: nxv8i16:
+; CHECK-512: // %bb.0:
+; CHECK-512-NEXT: ldr z0, [x0, #4, mul vl]
+; CHECK-512-NEXT: str z0, [x1, #4, mul vl]
+; CHECK-512-NEXT: ret
+;
+; CHECK-1024-LABEL: nxv8i16:
+; CHECK-1024: // %bb.0:
+; CHECK-1024-NEXT: ldr z0, [x0, #2, mul vl]
+; CHECK-1024-NEXT: str z0, [x1, #2, mul vl]
+; CHECK-1024-NEXT: ret
+;
+; CHECK-2048-LABEL: nxv8i16:
+; CHECK-2048: // %bb.0:
+; CHECK-2048-NEXT: ldr z0, [x0, #1, mul vl]
+; CHECK-2048-NEXT: str z0, [x1, #1, mul vl]
+; CHECK-2048-NEXT: ret
+ %ldoff = getelementptr inbounds nuw i16, ptr %ldptr, i64 128
+ %stoff = getelementptr inbounds nuw i16, ptr %stptr, i64 128
+ %x = load <vscale x 8 x i16>, ptr %ldoff, align 2
+ store <vscale x 8 x i16> %x, ptr %stoff, align 2
+ ret void
+}
+
+define void @nxv4i32(ptr %ldptr, ptr %stptr) {
+; CHECK-LABEL: nxv4i32:
+; CHECK: // %bb.0:
+; CHECK-NEXT: ptrue p0.s
+; CHECK-NEXT: mov x8, #64 // =0x40
+; CHECK-NEXT: ld1w { z0.s }, p0/z, [x0, x8, lsl #2]
+; CHECK-NEXT: st1w { z0.s }, p0, [x1, x8, lsl #2]
+; CHECK-NEXT: ret
+;
+; CHECK-128-LABEL: nxv4i32:
+; CHECK-128: // %bb.0:
+; CHECK-128-NEXT: ldr z0, [x0, #16, mul vl]
+; CHECK-128-NEXT: str z0, [x1, #16, mul vl]
+; CHECK-128-NEXT: ret
+;
+; CHECK-256-LABEL: nxv4i32:
+; CHECK-256: // %bb.0:
+; CHECK-256-NEXT: ldr z0, [x0, #8, mul vl]
+; CHECK-256-NEXT: str z0, [x1, #8, mul vl]
+; CHECK-256-NEXT: ret
+;
+; CHECK-512-LABEL: nxv4i32:
+; CHECK-512: // %bb.0:
+; CHECK-512-NEXT: ldr z0, [x0, #4, mul vl]
+; CHECK-512-NEXT: str z0, [x1, #4, mul vl]
+; CHECK-512-NEXT: ret
+;
+; CHECK-1024-LABEL: nxv4i32:
+; CHECK-1024: // %bb.0:
+; CHECK-1024-NEXT: ldr z0, [x0, #2, mul vl]
+; CHECK-1024-NEXT: str z0, [x1, #2, mul vl]
+; CHECK-1024-NEXT: ret
+;
+; CHECK-2048-LABEL: nxv4i32:
+; CHECK-2048: // %bb.0:
+; CHECK-2048-NEXT: ldr z0, [x0, #1, mul vl]
+; CHECK-2048-NEXT: str z0, [x1, #1, mul vl]
+; CHECK-2048-NEXT: ret
+ %ldoff = getelementptr inbounds nuw i32, ptr %ldptr, i64 64
+ %stoff = getelementptr inbounds nuw i32, ptr %stptr, i64 64
+ %x = load <vscale x 4 x i32>, ptr %ldoff, align 4
+ store <vscale x 4 x i32> %x, ptr %stoff, align 4
+ ret void
+}
+
+define void @nxv2i64(ptr %ldptr, ptr %stptr) {
+; CHECK-LABEL: nxv2i64:
+; CHECK: // %bb.0:
+; CHECK-NEXT: ptrue p0.d
+; CHECK-NEXT: mov x8, #32 // =0x20
+; CHECK-NEXT: ld1d { z0.d }, p0/z, [x0, x8, lsl #3]
+; CHECK-NEXT: st1d { z0.d }, p0, [x1, x8, lsl #3]
+; CHECK-NEXT: ret
+;
+; CHECK-128-LABEL: nxv2i64:
+; CHECK-128: // %bb.0:
+; CHECK-128-NEXT: ldr z0, [x0, #16, mul vl]
+; CHECK-128-NEXT: str z0, [x1, #16, mul vl]
+; CHECK-128-NEXT: ret
+;
+; CHECK-256-LABEL: nxv2i64:
+; CHECK-256: // %bb.0:
+; CHECK-256-NEXT: ldr z0, [x0, #8, mul vl]
+; CHECK-256-NEXT: str z0, [x1, #8, mul vl]
+; CHECK-256-NEXT: ret
+;
+; CHECK-512-LABEL: nxv2i64:
+; CHECK-512: // %bb.0:
+; CHECK-512-NEXT: ldr z0, [x0, #4, mul vl]
+; CHECK-512-NEXT: str z0, [x1, #4, mul vl]
+; CHECK-512-NEXT: ret
+;
+; CHECK-1024-LABEL: nxv2i64:
+; CHECK-1024: // %bb.0:
+; CHECK-1024-NEXT: ldr z0, [x0, #2, mul vl]
+; CHECK-1024-NEXT: str z0, [x1, #2, mul vl]
+; CHECK-1024-NEXT: ret
+;
+; CHECK-2048-LABEL: nxv2i64:
+; CHECK-2048: // %bb.0:
+; CHECK-2048-NEXT: ldr z0, [x0, #1, mul vl]
+; CHECK-2048-NEXT: str z0, [x1, #1, mul vl]
+; CHECK-2048-NEXT: ret
+ %ldoff = getelementptr inbounds nuw i64, ptr %ldptr, i64 32
+ %stoff = getelementptr inbounds nuw i64, ptr %stptr, i64 32
+ %x = load <vscale x 2 x i64>, ptr %ldoff, align 8
+ store <vscale x 2 x i64> %x, ptr %stoff, align 8
+ ret void
+}
+
+define void @nxv4i8(ptr %ldptr, ptr %stptr) {
+; CHECK-LABEL: nxv4i8:
+; CHECK: // %bb.0:
+; CHECK-NEXT: ptrue p0.s
+; CHECK-NEXT: mov w8, #32 // =0x20
+; CHECK-NEXT: ld1b { z0.s }, p0/z, [x0, x8]
+; CHECK-NEXT: st1b { z0.s }, p0, [x1, x8]
+; CHECK-NEXT: ret
+;
+; CHECK-128-LABEL: nxv4i8:
+; CHECK-128: // %bb.0:
+; CHECK-128-NEXT: ptrue p0.s
+; CHECK-128-NEXT: mov w8, #32 // =0x20
+; CHECK-128-NEXT: ld1b { z0.s }, p0/z, [x0, x8]
+; CHECK-128-NEXT: st1b { z0.s }, p0, [x1, x8]
+; CHECK-128-NEXT: ret
+;
+; CHECK-256-LABEL: nxv4i8:
+; CHECK-256: // %bb.0:
+; CHECK-256-NEXT: ptrue p0.s
+; CHECK-256-NEXT: ld1b { z0.s }, p0/z, [x0, #4, mul vl]
+; CHECK-256-NEXT: st1b { z0.s }, p0, [x1, #4, mul vl]
+; CHECK-256-NEXT: ret
+;
+; CHECK-512-LABEL: nxv4i8:
+; CHECK-512: // %bb.0:
+; CHECK-512-NEXT: ptrue p0.s
+; CHECK-512-NEXT: ld1b { z0.s }, p0/z, [x0, #2, mul vl]
+; CHECK-512-NEXT: st1b { z0.s }, p0, [x1, #2, mul vl]
+; CHECK-512-NEXT: ret
+;
+; CHECK-1024-LABEL: nxv4i8:
+; CHECK-1024: // %bb.0:
+; CHECK-1024-NEXT: ptrue p0.s
+; CHECK-1024-NEXT: ld1b { z0.s }, p0/z, [x0, #1, mul vl]
+; CHECK-1024-NEXT: st1b { z0.s }, p0, [x1, #1, mul vl]
+; CHECK-1024-NEXT: ret
+;
+; CHECK-2048-LABEL: nxv4i8:
+; CHECK-2048: // %bb.0:
+; CHECK-2048-NEXT: ptrue p0.s
+; CHECK-2048-NEXT: mov w8, #32 // =0x20
+; CHECK-2048-NEXT: ld1b { z0.s }, p0/z, [x0, x8]
+; CHECK-2048-NEXT: st1b { z0.s }, p0, [x1, x8]
+; CHECK-2048-NEXT: ret
+ %ldoff = getelementptr inbounds nuw i8, ptr %ldptr, i64 32
+ %stoff = getelementptr inbounds nuw i8, ptr %stptr, i64 32
+ %x = load <vscale x 4 x i8>, ptr %ldoff, align 1
+ store <vscale x 4 x i8> %x, ptr %stoff, align 1
+ ret void
+}
+
+define void @nxv2f32(ptr %ldptr, ptr %stptr) {
+; CHECK-LABEL: nxv2f32:
+; CHECK: // %bb.0:
+; CHECK-NEXT: ptrue p0.d
+; CHECK-NEXT: mov x8, #16 // =0x10
+; CHECK-NEXT: ld1w { z0.d }, p0/z, [x0, x8, lsl #2]
+; CHECK-NEXT: st1w { z0.d }, p0, [x1, x8, lsl #2]
+; CHECK-NEXT: ret
+;
+; CHECK-128-LABEL: nxv2f32:
+; CHECK-128: // %bb.0:
+; CHECK-128-NEXT: ptrue p0.d
+; CHECK-128-NEXT: mov x8, #16 // =0x10
+; CHECK-128-NEXT: ld1w { z0.d }, p0/z, [x0, x8, lsl #2]
+; CHECK-128-NEXT: st1w { z0.d }, p0, [x1, x8, lsl #2]
+; CHECK-128-NEXT: ret
+;
+; CHECK-256-LABEL: nxv2f32:
+; CHECK-256: // %bb.0:
+; CHECK-256-NEXT: ptrue p0.d
+; CHECK-256-NEXT: ld1w { z0.d }, p0/z, [x0, #4, mul vl]
+; CHECK-256-NEXT: st1w { z0.d }, p0, [x1, #4, mul vl]
+; CHECK-256-NEXT: ret
+;
+; CHECK-512-LABEL: nxv2f32:
+; CHECK-512: // %bb.0:
+; CHECK-512-NEXT: ptrue p0.d
+; CHECK-512-NEXT: ld1w { z0.d }, p0/z, [x0, #2, mul vl]
+; CHECK-512-NEXT: st1w { z0.d }, p0, [x1, #2, mul vl]
+; CHECK-512-NEXT: ret
+;
+; CHECK-1024-LABEL: nxv2f32:
+; CHECK-1024: // %bb.0:
+; CHECK-1024-NEXT: ptrue p0.d
+; CHECK-1024-NEXT: ld1w { z0.d }, p0/z, [x0, #1, mul vl]
+; CHECK-1024-NEXT: st1w { z0.d }, p0, [x1, #1, mul vl]
+; CHECK-1024-NEXT: ret
+;
+; CHECK-2048-LABEL: nxv2f32:
+; CHECK-2048: // %bb.0:
+; CHECK-2048-NEXT: ptrue p0.d
+; CHECK-2048-NEXT: mov x8, #16 // =0x10
+; CHECK-2048-NEXT: ld1w { z0.d }, p0/z, [x0, x8, lsl #2]
+; CHECK-2048-NEXT: st1w { z0.d }, p0, [x1, x8, lsl #2]
+; CHECK-2048-NEXT: ret
+ %ldoff = getelementptr inbounds nuw i8, ptr %ldptr, i64 64
+ %stoff = getelementptr inbounds nuw i8, ptr %stptr, i64 64
+ %x = load <vscale x 2 x float>, ptr %ldoff, align 4
+ store <vscale x 2 x float> %x, ptr %stoff, align 4
+ ret void
+}
+
+define void @nxv4f64(ptr %ldptr, ptr %stptr) {
+; CHECK-LABEL: nxv4f64:
+; CHECK: // %bb.0:
+; CHECK-NEXT: ptrue p0.d
+; CHECK-NEXT: mov x8, #16 // =0x10
+; CHECK-NEXT: add x9, x0, #128
+; CHECK-NEXT: ldr z1, [x9, #1, mul vl]
+; CHECK-NEXT: add x9, x1, #128
+; CHECK-NEXT: ld1d { z0.d }, p0/z, [x0, x8, lsl #3]
+; CHECK-NEXT: st1d { z0.d }, p0, [x1, x8, lsl #3]
+; CHECK-NEXT: str z1, [x9, #1, mul vl]
+; CHECK-NEXT: ret
+;
+; CHECK-128-LABEL: nxv4f64:
+; CHECK-128: // %bb.0:
+; CHECK-128-NEXT: add x8, x0, #128
+; CHECK-128-NEXT: ldr z1, [x0, #8, mul vl]
+; CHECK-128-NEXT: ldr z0, [x8, #1, mul vl]
+; CHECK-128-NEXT: add x8, x1, #128
+; CHECK-128-NEXT: str z0, [x8, #1, mul vl]
+; CHECK-128-NEXT: str z1, [x1, #8, mul vl]
+; CHECK-128-NEXT: ret
+;
+; CHECK-256-LABEL: nxv4f64:
+; CHECK-256: // %bb.0:
+; CHECK-256-NEXT: add x8, x0, #128
+; CHECK-256-NEXT: ldr z1, [x0, #4, mul vl]
+; CHECK-256-NEXT: ldr z0, [x8, #1, mul vl]
+; CHECK-256-NEXT: add x8, x1, #128
+; CHECK-256-NEXT: str z0, [x8, #1, mul vl]
+; CHECK-256-NEXT: str z1, [x1, #4, mul vl]
+; CHECK-256-NEXT: ret
+;
+; CHECK-512-LABEL: nxv4f64:
+; CHECK-512: // %bb.0:
+; CHECK-512-NEXT: add x8, x0, #128
+; CHECK-512-NEXT: ldr z1, [x0, #2, mul vl]
+; CHECK-512-NEXT: ldr z0, [x8, #1, mul vl]
+; CHECK-512-NEXT: add x8, x1, #128
+; CHECK-512-NEXT: str z0, [x8, #1, mul vl]
+; CHECK-512-NEXT: str z1, [x1, #2, mul vl]
+; CHECK-512-NEXT: ret
+;
+; CHECK-1024-LABEL: nxv4f64:
+; CHECK-1024: // %bb.0:
+; CHECK-1024-NEXT: add x8, x0, #128
+; CHECK-1024-NEXT: ldr z1, [x0, #1, mul vl]
+; CHECK-1024-NEXT: ldr z0, [x8, #1, mul vl]
+; CHECK-1024-NEXT: add x8, x1, #128
+; CHECK-1024-NEXT: str z0, [x8, #1, mul vl]
+; CHECK-1024-NEXT: str z1, [x1, #1, mul vl]
+; CHECK-1024-NEXT: ret
+;
+; CHECK-2048-LABEL: nxv4f64:
+; CHECK-2048: // %bb.0:
+; CHECK-2048-NEXT: ptrue p0.d
+; CHECK-2048-NEXT: mov x8, #16 // =0x10
+; CHECK-2048-NEXT: add x9, x0, #128
+; CHECK-2048-NEXT: ldr z1, [x9, #1, mul vl]
+; CHECK-2048-NEXT: add x9, x1, #128
+; CHECK-2048-NEXT: ld1d { z0.d }, p0/z, [x0, x8, lsl #3]
+; CHECK-2048-NEXT: st1d { z0.d }, p0, [x1, x8, lsl #3]
+; CHECK-2048-NEXT: str z1, [x9, #1, mul vl]
+; CHECK-2048-NEXT: ret
+ %ldoff = getelementptr inbounds nuw i8, ptr %ldptr, i64 128
+ %stoff = getelementptr inbounds nuw i8, ptr %stptr, i64 128
+ %x = load <vscale x 4 x double>, ptr %ldoff, align 8
+ store <vscale x 4 x double> %x, ptr %stoff, align 8
+ ret void
+}
+
+define void @v8i32(ptr %ldptr, ptr %stptr) {
+; CHECK-LABEL: v8i32:
+; CHECK: // %bb.0:
+; CHECK-NEXT: ldp q0, q1, [x0, #64]
+; CHECK-NEXT: ldp q3, q2, [x0, #32]
+; CHECK-NEXT: stp q0, q1, [x1, #64]
+; CHECK-NEXT: stp q3, q2, [x1, #32]
+; CHECK-NEXT: ret
+;
+; CHECK-128-LABEL: v8i32:
+; CHECK-128: // %bb.0:
+; CHECK-128-NEXT: ldp q0, q1, [x0, #64]
+; CHECK-128-NEXT: ldp q3, q2, [x0, #32]
+; CHECK-128-NEXT: stp q0, q1, [x1, #64]
+; CHECK-128-NEXT: stp q3, q2, [x1, #32]
+; CHECK-128-NEXT: ret
+;
+; CHECK-256-LABEL: v8i32:
+; CHECK-256: // %bb.0:
+; CHECK-256-NEXT: ptrue p0.s
+; CHECK-256-NEXT: mov x8, #16 // =0x10
+; CHECK-256-NEXT: mov x9, #8 // =0x8
+; CHECK-256-NEXT: ld1w { z0.s }, p0/z, [x0, x8, lsl #2]
+; CHECK-256-NEXT: ld1w { z1.s }, p0/z, [x0, x9, lsl #2]
+; CHECK-256-NEXT: st1w { z0.s }, p0, [x1, x8, lsl #2]
+; CHECK-256-NEXT: st1w { z1.s }, p0, [x1, x9, lsl #2]
+; CHECK-256-NEXT: ret
+;
+; CHECK-512-LABEL: v8i32:
+; CHECK-512: // %bb.0:
+; CHECK-512-NEXT: ptrue p0.s
+; CHECK-512-NEXT: mov x8, #8 // =0x8
+; CHECK-512-NEXT: ld1w { z0.s }, p0/z, [x0, x8, lsl #2]
+; CHECK-512-NEXT: st1w { z0.s }, p0, [x1, x8, lsl #2]
+; CHECK-512-NEXT: ret
+;
+; CHECK-1024-LABEL: v8i32:
+; CHECK-1024: // %bb.0:
+; CHECK-1024-NEXT: ptrue p0.s, vl16
+; CHECK-1024-NEXT: mov x8, #8 // =0x8
+; CHECK-1024-NEXT: ld1w { z0.s }, p0/z, [x0, x8, lsl #2]
+; CHECK-1024-NEXT: st1w { z0.s }, p0, [x1, x8, lsl #2]
+; CHECK-1024-NEXT: ret
+;
+; CHECK-2048-LABEL: v8i32:
+; CHECK-2048: // %bb.0:
+; CHECK-2048-NEXT: ptrue p0.s, vl16
+; CHECK-2048-NEXT: mov x8, #8 // =0x8
+; CHECK-2048-NEXT: ld1w { z0.s }, p0/z, [x0, x8, lsl #2]
+; CHECK-2048-NEXT: st1w { z0.s }, p0, [x1, x8, lsl #2]
+; CHECK-2048-NEXT: ret
+ %ldoff = getelementptr inbounds nuw i8, ptr %ldptr, i64 32
+ %stoff = getelementptr inbounds nuw i8, ptr %stptr, i64 32
+ %x = load <16 x i32>, ptr %ldoff, align 4
+ store <16 x i32> %x, ptr %stoff, align 4
+ ret void
+}
+
+; FIXME: This is wrong for VLS.
+define void @v8i32_vscale(ptr %0) {
+; CHECK-LABEL: v8i32_vscale:
+; CHECK: // %bb.0:
+; CHECK-NEXT: movi v0.4s, #1
+; CHECK-NEXT: rdvl x8, #2
+; CHECK-NEXT: add x8, x0, x8
+; CHECK-NEXT: stp q0, q0, [x8]
+; CHECK-NEXT: ret
+;
+; CHECK-128-LABEL: v8i32_vscale:
+; CHECK-128: // %bb.0:
+; CHECK-128-NEXT: movi v0.4s, #1
+; CHECK-128-NEXT: rdvl x8, #2
+; CHECK-128-NEXT: add x8, x0, x8
+; CHECK-128-NEXT: stp q0, q0, [x8]
+; CHECK-128-NEXT: ret
+;
+; CHECK-256-LABEL: v8i32_vscale:
+; CHECK-256: // %bb.0:
+; CHECK-256-NEXT: mov z0.s, #1 // =0x1
+; CHECK-256-NEXT: ptrue p0.s
+; CHECK-256-NEXT: st1w { z0.s }, p0, [x0, #1, mul vl]
+; CHECK-256-NEXT: ret
+;
+; CHECK-512-LABEL: v8i32_vscale:
+; CHECK-512: // %bb.0:
+; CHECK-512-NEXT: mov z0.s, #1 // =0x1
+; CHECK-512-NEXT: ptrue p0.s, vl8
+; CHECK-512-NEXT: st1w { z0.s }, p0, [x0, #1, mul vl]
+; CHECK-512-NEXT: ret
+;
+; CHECK-1024-LABEL: v8i32_vscale:
+; CHECK-1024: // %bb.0:
+; CHECK-1024-NEXT: mov z0.s, #1 // =0x1
+; CHECK-1024-NEXT: ptrue p0.s, vl8
+; CHECK-1024-NEXT: st1w { z0.s }, p0, [x0, #1, mul vl]
+; CHECK-1024-NEXT: ret
+;
+; CHECK-2048-LABEL: v8i32_vscale:
+; CHECK-2048: // %bb.0:
+; CHECK-2048-NEXT: mov z0.s, #1 // =0x1
+; CHECK-2048-NEXT: ptrue p0.s, vl8
+; CHECK-2048-NEXT: st1w { z0.s }, p0, [x0, #1, mul vl]
+; CHECK-2048-NEXT: ret
+ %vl = call i64 @llvm.vscale()
+ %vlx = shl i64 %vl, 5
+ %2 = getelementptr inbounds nuw i8, ptr %0, i64 %vlx
+ store <8 x i32> splat (i32 1), ptr %2, align 4
+ ret void
+}
|
@llvm/pr-subscribers-clang Author: Ricardo Jesus (rj-jesus) ChangesThis restores commit f01e760. The original patch from #129732 exposed what seems to be a bug in Currently, the offset returned by Although this seems to affect both VSCALE-based and Constant-based offsets, I believe we didn't come across it earlier because we don't generate combinations of VSCALE offsets + fixed vectors often. Enabling the Constant-based path made the problem (assuming it is a problem) obvious because combinations of Constant offsets + fixed vectors are common. To work around the issue temporarily, I added an early exit to the Constant-based path for fixed vector types. I think the long-term solution is to set What do you think? Full diff: https://github.com/llvm/llvm-project/pull/130625.diff 4 Files Affected:
diff --git a/clang/test/CodeGen/AArch64/sve-vector-bits-codegen.c b/clang/test/CodeGen/AArch64/sve-vector-bits-codegen.c
index 0ed14b4b3b793..1391a1b09fbd1 100644
--- a/clang/test/CodeGen/AArch64/sve-vector-bits-codegen.c
+++ b/clang/test/CodeGen/AArch64/sve-vector-bits-codegen.c
@@ -13,12 +13,9 @@
void func(int *restrict a, int *restrict b) {
// CHECK-LABEL: func
-// CHECK256-COUNT-1: str
-// CHECK256-COUNT-7: st1w
-// CHECK512-COUNT-1: str
-// CHECK512-COUNT-3: st1w
-// CHECK1024-COUNT-1: str
-// CHECK1024-COUNT-1: st1w
+// CHECK256-COUNT-8: str
+// CHECK512-COUNT-4: str
+// CHECK1024-COUNT-2: str
// CHECK2048-COUNT-1: st1w
#pragma clang loop vectorize(enable)
for (int i = 0; i < 64; ++i)
diff --git a/llvm/lib/Target/AArch64/AArch64ISelDAGToDAG.cpp b/llvm/lib/Target/AArch64/AArch64ISelDAGToDAG.cpp
index 3ca9107cb2ce5..d338c22267885 100644
--- a/llvm/lib/Target/AArch64/AArch64ISelDAGToDAG.cpp
+++ b/llvm/lib/Target/AArch64/AArch64ISelDAGToDAG.cpp
@@ -7380,12 +7380,24 @@ bool AArch64DAGToDAGISel::SelectAddrModeIndexedSVE(SDNode *Root, SDValue N,
return false;
SDValue VScale = N.getOperand(1);
- if (VScale.getOpcode() != ISD::VSCALE)
+ int64_t MulImm = std::numeric_limits<int64_t>::max();
+ if (VScale.getOpcode() == ISD::VSCALE) {
+ MulImm = cast<ConstantSDNode>(VScale.getOperand(0))->getSExtValue();
+ } else if (auto C = dyn_cast<ConstantSDNode>(VScale)) {
+ int64_t ByteOffset = C->getSExtValue();
+ const auto KnownVScale =
+ Subtarget->getSVEVectorSizeInBits() / AArch64::SVEBitsPerBlock;
+
+ if (!KnownVScale || ByteOffset % KnownVScale != 0 ||
+ !MemVT.isScalableVector())
+ return false;
+
+ MulImm = ByteOffset / KnownVScale;
+ } else
return false;
TypeSize TS = MemVT.getSizeInBits();
int64_t MemWidthBytes = static_cast<int64_t>(TS.getKnownMinValue()) / 8;
- int64_t MulImm = cast<ConstantSDNode>(VScale.getOperand(0))->getSExtValue();
if ((MulImm % MemWidthBytes) != 0)
return false;
diff --git a/llvm/lib/Target/AArch64/AArch64Subtarget.h b/llvm/lib/Target/AArch64/AArch64Subtarget.h
index c6eb77e3bc3ba..f5ffc72cae537 100644
--- a/llvm/lib/Target/AArch64/AArch64Subtarget.h
+++ b/llvm/lib/Target/AArch64/AArch64Subtarget.h
@@ -391,7 +391,7 @@ class AArch64Subtarget final : public AArch64GenSubtargetInfo {
void mirFileLoaded(MachineFunction &MF) const override;
// Return the known range for the bit length of SVE data registers. A value
- // of 0 means nothing is known about that particular limit beyong what's
+ // of 0 means nothing is known about that particular limit beyond what's
// implied by the architecture.
unsigned getMaxSVEVectorSizeInBits() const {
assert(isSVEorStreamingSVEAvailable() &&
@@ -405,6 +405,16 @@ class AArch64Subtarget final : public AArch64GenSubtargetInfo {
return MinSVEVectorSizeInBits;
}
+ // Return the known bit length of SVE data registers. A value of 0 means the
+ // length is unkown beyond what's implied by the architecture.
+ unsigned getSVEVectorSizeInBits() const {
+ assert(isSVEorStreamingSVEAvailable() &&
+ "Tried to get SVE vector length without SVE support!");
+ if (MinSVEVectorSizeInBits == MaxSVEVectorSizeInBits)
+ return MaxSVEVectorSizeInBits;
+ return 0;
+ }
+
bool useSVEForFixedLengthVectors() const {
if (!isSVEorStreamingSVEAvailable())
return false;
diff --git a/llvm/test/CodeGen/AArch64/sve-fixed-length-offsets.ll b/llvm/test/CodeGen/AArch64/sve-fixed-length-offsets.ll
new file mode 100644
index 0000000000000..84ab5493b03ee
--- /dev/null
+++ b/llvm/test/CodeGen/AArch64/sve-fixed-length-offsets.ll
@@ -0,0 +1,472 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
+; RUN: llc -mtriple=aarch64-linux-gnu -mattr=+sve < %s | FileCheck %s
+; RUN: llc -mtriple=aarch64-linux-gnu -mattr=+sve -aarch64-sve-vector-bits-min=128 -aarch64-sve-vector-bits-max=128 < %s | FileCheck %s --check-prefix=CHECK-128
+; RUN: llc -mtriple=aarch64-linux-gnu -mattr=+sve -aarch64-sve-vector-bits-min=256 -aarch64-sve-vector-bits-max=256 < %s | FileCheck %s --check-prefix=CHECK-256
+; RUN: llc -mtriple=aarch64-linux-gnu -mattr=+sve -aarch64-sve-vector-bits-min=512 -aarch64-sve-vector-bits-max=512 < %s | FileCheck %s --check-prefix=CHECK-512
+; RUN: llc -mtriple=aarch64-linux-gnu -mattr=+sve -aarch64-sve-vector-bits-min=1024 -aarch64-sve-vector-bits-max=1024 < %s | FileCheck %s --check-prefix=CHECK-1024
+; RUN: llc -mtriple=aarch64-linux-gnu -mattr=+sve -aarch64-sve-vector-bits-min=2048 -aarch64-sve-vector-bits-max=2048 < %s | FileCheck %s --check-prefix=CHECK-2048
+
+define void @nxv16i8(ptr %ldptr, ptr %stptr) {
+; CHECK-LABEL: nxv16i8:
+; CHECK: // %bb.0:
+; CHECK-NEXT: ptrue p0.b
+; CHECK-NEXT: mov w8, #256 // =0x100
+; CHECK-NEXT: ld1b { z0.b }, p0/z, [x0, x8]
+; CHECK-NEXT: st1b { z0.b }, p0, [x1, x8]
+; CHECK-NEXT: ret
+;
+; CHECK-128-LABEL: nxv16i8:
+; CHECK-128: // %bb.0:
+; CHECK-128-NEXT: ldr z0, [x0, #16, mul vl]
+; CHECK-128-NEXT: str z0, [x1, #16, mul vl]
+; CHECK-128-NEXT: ret
+;
+; CHECK-256-LABEL: nxv16i8:
+; CHECK-256: // %bb.0:
+; CHECK-256-NEXT: ldr z0, [x0, #8, mul vl]
+; CHECK-256-NEXT: str z0, [x1, #8, mul vl]
+; CHECK-256-NEXT: ret
+;
+; CHECK-512-LABEL: nxv16i8:
+; CHECK-512: // %bb.0:
+; CHECK-512-NEXT: ldr z0, [x0, #4, mul vl]
+; CHECK-512-NEXT: str z0, [x1, #4, mul vl]
+; CHECK-512-NEXT: ret
+;
+; CHECK-1024-LABEL: nxv16i8:
+; CHECK-1024: // %bb.0:
+; CHECK-1024-NEXT: ldr z0, [x0, #2, mul vl]
+; CHECK-1024-NEXT: str z0, [x1, #2, mul vl]
+; CHECK-1024-NEXT: ret
+;
+; CHECK-2048-LABEL: nxv16i8:
+; CHECK-2048: // %bb.0:
+; CHECK-2048-NEXT: ldr z0, [x0, #1, mul vl]
+; CHECK-2048-NEXT: str z0, [x1, #1, mul vl]
+; CHECK-2048-NEXT: ret
+ %ldoff = getelementptr inbounds nuw i8, ptr %ldptr, i64 256
+ %stoff = getelementptr inbounds nuw i8, ptr %stptr, i64 256
+ %x = load <vscale x 16 x i8>, ptr %ldoff, align 1
+ store <vscale x 16 x i8> %x, ptr %stoff, align 1
+ ret void
+}
+
+define void @nxv8i16(ptr %ldptr, ptr %stptr) {
+; CHECK-LABEL: nxv8i16:
+; CHECK: // %bb.0:
+; CHECK-NEXT: ptrue p0.h
+; CHECK-NEXT: mov x8, #128 // =0x80
+; CHECK-NEXT: ld1h { z0.h }, p0/z, [x0, x8, lsl #1]
+; CHECK-NEXT: st1h { z0.h }, p0, [x1, x8, lsl #1]
+; CHECK-NEXT: ret
+;
+; CHECK-128-LABEL: nxv8i16:
+; CHECK-128: // %bb.0:
+; CHECK-128-NEXT: ldr z0, [x0, #16, mul vl]
+; CHECK-128-NEXT: str z0, [x1, #16, mul vl]
+; CHECK-128-NEXT: ret
+;
+; CHECK-256-LABEL: nxv8i16:
+; CHECK-256: // %bb.0:
+; CHECK-256-NEXT: ldr z0, [x0, #8, mul vl]
+; CHECK-256-NEXT: str z0, [x1, #8, mul vl]
+; CHECK-256-NEXT: ret
+;
+; CHECK-512-LABEL: nxv8i16:
+; CHECK-512: // %bb.0:
+; CHECK-512-NEXT: ldr z0, [x0, #4, mul vl]
+; CHECK-512-NEXT: str z0, [x1, #4, mul vl]
+; CHECK-512-NEXT: ret
+;
+; CHECK-1024-LABEL: nxv8i16:
+; CHECK-1024: // %bb.0:
+; CHECK-1024-NEXT: ldr z0, [x0, #2, mul vl]
+; CHECK-1024-NEXT: str z0, [x1, #2, mul vl]
+; CHECK-1024-NEXT: ret
+;
+; CHECK-2048-LABEL: nxv8i16:
+; CHECK-2048: // %bb.0:
+; CHECK-2048-NEXT: ldr z0, [x0, #1, mul vl]
+; CHECK-2048-NEXT: str z0, [x1, #1, mul vl]
+; CHECK-2048-NEXT: ret
+ %ldoff = getelementptr inbounds nuw i16, ptr %ldptr, i64 128
+ %stoff = getelementptr inbounds nuw i16, ptr %stptr, i64 128
+ %x = load <vscale x 8 x i16>, ptr %ldoff, align 2
+ store <vscale x 8 x i16> %x, ptr %stoff, align 2
+ ret void
+}
+
+define void @nxv4i32(ptr %ldptr, ptr %stptr) {
+; CHECK-LABEL: nxv4i32:
+; CHECK: // %bb.0:
+; CHECK-NEXT: ptrue p0.s
+; CHECK-NEXT: mov x8, #64 // =0x40
+; CHECK-NEXT: ld1w { z0.s }, p0/z, [x0, x8, lsl #2]
+; CHECK-NEXT: st1w { z0.s }, p0, [x1, x8, lsl #2]
+; CHECK-NEXT: ret
+;
+; CHECK-128-LABEL: nxv4i32:
+; CHECK-128: // %bb.0:
+; CHECK-128-NEXT: ldr z0, [x0, #16, mul vl]
+; CHECK-128-NEXT: str z0, [x1, #16, mul vl]
+; CHECK-128-NEXT: ret
+;
+; CHECK-256-LABEL: nxv4i32:
+; CHECK-256: // %bb.0:
+; CHECK-256-NEXT: ldr z0, [x0, #8, mul vl]
+; CHECK-256-NEXT: str z0, [x1, #8, mul vl]
+; CHECK-256-NEXT: ret
+;
+; CHECK-512-LABEL: nxv4i32:
+; CHECK-512: // %bb.0:
+; CHECK-512-NEXT: ldr z0, [x0, #4, mul vl]
+; CHECK-512-NEXT: str z0, [x1, #4, mul vl]
+; CHECK-512-NEXT: ret
+;
+; CHECK-1024-LABEL: nxv4i32:
+; CHECK-1024: // %bb.0:
+; CHECK-1024-NEXT: ldr z0, [x0, #2, mul vl]
+; CHECK-1024-NEXT: str z0, [x1, #2, mul vl]
+; CHECK-1024-NEXT: ret
+;
+; CHECK-2048-LABEL: nxv4i32:
+; CHECK-2048: // %bb.0:
+; CHECK-2048-NEXT: ldr z0, [x0, #1, mul vl]
+; CHECK-2048-NEXT: str z0, [x1, #1, mul vl]
+; CHECK-2048-NEXT: ret
+ %ldoff = getelementptr inbounds nuw i32, ptr %ldptr, i64 64
+ %stoff = getelementptr inbounds nuw i32, ptr %stptr, i64 64
+ %x = load <vscale x 4 x i32>, ptr %ldoff, align 4
+ store <vscale x 4 x i32> %x, ptr %stoff, align 4
+ ret void
+}
+
+define void @nxv2i64(ptr %ldptr, ptr %stptr) {
+; CHECK-LABEL: nxv2i64:
+; CHECK: // %bb.0:
+; CHECK-NEXT: ptrue p0.d
+; CHECK-NEXT: mov x8, #32 // =0x20
+; CHECK-NEXT: ld1d { z0.d }, p0/z, [x0, x8, lsl #3]
+; CHECK-NEXT: st1d { z0.d }, p0, [x1, x8, lsl #3]
+; CHECK-NEXT: ret
+;
+; CHECK-128-LABEL: nxv2i64:
+; CHECK-128: // %bb.0:
+; CHECK-128-NEXT: ldr z0, [x0, #16, mul vl]
+; CHECK-128-NEXT: str z0, [x1, #16, mul vl]
+; CHECK-128-NEXT: ret
+;
+; CHECK-256-LABEL: nxv2i64:
+; CHECK-256: // %bb.0:
+; CHECK-256-NEXT: ldr z0, [x0, #8, mul vl]
+; CHECK-256-NEXT: str z0, [x1, #8, mul vl]
+; CHECK-256-NEXT: ret
+;
+; CHECK-512-LABEL: nxv2i64:
+; CHECK-512: // %bb.0:
+; CHECK-512-NEXT: ldr z0, [x0, #4, mul vl]
+; CHECK-512-NEXT: str z0, [x1, #4, mul vl]
+; CHECK-512-NEXT: ret
+;
+; CHECK-1024-LABEL: nxv2i64:
+; CHECK-1024: // %bb.0:
+; CHECK-1024-NEXT: ldr z0, [x0, #2, mul vl]
+; CHECK-1024-NEXT: str z0, [x1, #2, mul vl]
+; CHECK-1024-NEXT: ret
+;
+; CHECK-2048-LABEL: nxv2i64:
+; CHECK-2048: // %bb.0:
+; CHECK-2048-NEXT: ldr z0, [x0, #1, mul vl]
+; CHECK-2048-NEXT: str z0, [x1, #1, mul vl]
+; CHECK-2048-NEXT: ret
+ %ldoff = getelementptr inbounds nuw i64, ptr %ldptr, i64 32
+ %stoff = getelementptr inbounds nuw i64, ptr %stptr, i64 32
+ %x = load <vscale x 2 x i64>, ptr %ldoff, align 8
+ store <vscale x 2 x i64> %x, ptr %stoff, align 8
+ ret void
+}
+
+define void @nxv4i8(ptr %ldptr, ptr %stptr) {
+; CHECK-LABEL: nxv4i8:
+; CHECK: // %bb.0:
+; CHECK-NEXT: ptrue p0.s
+; CHECK-NEXT: mov w8, #32 // =0x20
+; CHECK-NEXT: ld1b { z0.s }, p0/z, [x0, x8]
+; CHECK-NEXT: st1b { z0.s }, p0, [x1, x8]
+; CHECK-NEXT: ret
+;
+; CHECK-128-LABEL: nxv4i8:
+; CHECK-128: // %bb.0:
+; CHECK-128-NEXT: ptrue p0.s
+; CHECK-128-NEXT: mov w8, #32 // =0x20
+; CHECK-128-NEXT: ld1b { z0.s }, p0/z, [x0, x8]
+; CHECK-128-NEXT: st1b { z0.s }, p0, [x1, x8]
+; CHECK-128-NEXT: ret
+;
+; CHECK-256-LABEL: nxv4i8:
+; CHECK-256: // %bb.0:
+; CHECK-256-NEXT: ptrue p0.s
+; CHECK-256-NEXT: ld1b { z0.s }, p0/z, [x0, #4, mul vl]
+; CHECK-256-NEXT: st1b { z0.s }, p0, [x1, #4, mul vl]
+; CHECK-256-NEXT: ret
+;
+; CHECK-512-LABEL: nxv4i8:
+; CHECK-512: // %bb.0:
+; CHECK-512-NEXT: ptrue p0.s
+; CHECK-512-NEXT: ld1b { z0.s }, p0/z, [x0, #2, mul vl]
+; CHECK-512-NEXT: st1b { z0.s }, p0, [x1, #2, mul vl]
+; CHECK-512-NEXT: ret
+;
+; CHECK-1024-LABEL: nxv4i8:
+; CHECK-1024: // %bb.0:
+; CHECK-1024-NEXT: ptrue p0.s
+; CHECK-1024-NEXT: ld1b { z0.s }, p0/z, [x0, #1, mul vl]
+; CHECK-1024-NEXT: st1b { z0.s }, p0, [x1, #1, mul vl]
+; CHECK-1024-NEXT: ret
+;
+; CHECK-2048-LABEL: nxv4i8:
+; CHECK-2048: // %bb.0:
+; CHECK-2048-NEXT: ptrue p0.s
+; CHECK-2048-NEXT: mov w8, #32 // =0x20
+; CHECK-2048-NEXT: ld1b { z0.s }, p0/z, [x0, x8]
+; CHECK-2048-NEXT: st1b { z0.s }, p0, [x1, x8]
+; CHECK-2048-NEXT: ret
+ %ldoff = getelementptr inbounds nuw i8, ptr %ldptr, i64 32
+ %stoff = getelementptr inbounds nuw i8, ptr %stptr, i64 32
+ %x = load <vscale x 4 x i8>, ptr %ldoff, align 1
+ store <vscale x 4 x i8> %x, ptr %stoff, align 1
+ ret void
+}
+
+define void @nxv2f32(ptr %ldptr, ptr %stptr) {
+; CHECK-LABEL: nxv2f32:
+; CHECK: // %bb.0:
+; CHECK-NEXT: ptrue p0.d
+; CHECK-NEXT: mov x8, #16 // =0x10
+; CHECK-NEXT: ld1w { z0.d }, p0/z, [x0, x8, lsl #2]
+; CHECK-NEXT: st1w { z0.d }, p0, [x1, x8, lsl #2]
+; CHECK-NEXT: ret
+;
+; CHECK-128-LABEL: nxv2f32:
+; CHECK-128: // %bb.0:
+; CHECK-128-NEXT: ptrue p0.d
+; CHECK-128-NEXT: mov x8, #16 // =0x10
+; CHECK-128-NEXT: ld1w { z0.d }, p0/z, [x0, x8, lsl #2]
+; CHECK-128-NEXT: st1w { z0.d }, p0, [x1, x8, lsl #2]
+; CHECK-128-NEXT: ret
+;
+; CHECK-256-LABEL: nxv2f32:
+; CHECK-256: // %bb.0:
+; CHECK-256-NEXT: ptrue p0.d
+; CHECK-256-NEXT: ld1w { z0.d }, p0/z, [x0, #4, mul vl]
+; CHECK-256-NEXT: st1w { z0.d }, p0, [x1, #4, mul vl]
+; CHECK-256-NEXT: ret
+;
+; CHECK-512-LABEL: nxv2f32:
+; CHECK-512: // %bb.0:
+; CHECK-512-NEXT: ptrue p0.d
+; CHECK-512-NEXT: ld1w { z0.d }, p0/z, [x0, #2, mul vl]
+; CHECK-512-NEXT: st1w { z0.d }, p0, [x1, #2, mul vl]
+; CHECK-512-NEXT: ret
+;
+; CHECK-1024-LABEL: nxv2f32:
+; CHECK-1024: // %bb.0:
+; CHECK-1024-NEXT: ptrue p0.d
+; CHECK-1024-NEXT: ld1w { z0.d }, p0/z, [x0, #1, mul vl]
+; CHECK-1024-NEXT: st1w { z0.d }, p0, [x1, #1, mul vl]
+; CHECK-1024-NEXT: ret
+;
+; CHECK-2048-LABEL: nxv2f32:
+; CHECK-2048: // %bb.0:
+; CHECK-2048-NEXT: ptrue p0.d
+; CHECK-2048-NEXT: mov x8, #16 // =0x10
+; CHECK-2048-NEXT: ld1w { z0.d }, p0/z, [x0, x8, lsl #2]
+; CHECK-2048-NEXT: st1w { z0.d }, p0, [x1, x8, lsl #2]
+; CHECK-2048-NEXT: ret
+ %ldoff = getelementptr inbounds nuw i8, ptr %ldptr, i64 64
+ %stoff = getelementptr inbounds nuw i8, ptr %stptr, i64 64
+ %x = load <vscale x 2 x float>, ptr %ldoff, align 4
+ store <vscale x 2 x float> %x, ptr %stoff, align 4
+ ret void
+}
+
+define void @nxv4f64(ptr %ldptr, ptr %stptr) {
+; CHECK-LABEL: nxv4f64:
+; CHECK: // %bb.0:
+; CHECK-NEXT: ptrue p0.d
+; CHECK-NEXT: mov x8, #16 // =0x10
+; CHECK-NEXT: add x9, x0, #128
+; CHECK-NEXT: ldr z1, [x9, #1, mul vl]
+; CHECK-NEXT: add x9, x1, #128
+; CHECK-NEXT: ld1d { z0.d }, p0/z, [x0, x8, lsl #3]
+; CHECK-NEXT: st1d { z0.d }, p0, [x1, x8, lsl #3]
+; CHECK-NEXT: str z1, [x9, #1, mul vl]
+; CHECK-NEXT: ret
+;
+; CHECK-128-LABEL: nxv4f64:
+; CHECK-128: // %bb.0:
+; CHECK-128-NEXT: add x8, x0, #128
+; CHECK-128-NEXT: ldr z1, [x0, #8, mul vl]
+; CHECK-128-NEXT: ldr z0, [x8, #1, mul vl]
+; CHECK-128-NEXT: add x8, x1, #128
+; CHECK-128-NEXT: str z0, [x8, #1, mul vl]
+; CHECK-128-NEXT: str z1, [x1, #8, mul vl]
+; CHECK-128-NEXT: ret
+;
+; CHECK-256-LABEL: nxv4f64:
+; CHECK-256: // %bb.0:
+; CHECK-256-NEXT: add x8, x0, #128
+; CHECK-256-NEXT: ldr z1, [x0, #4, mul vl]
+; CHECK-256-NEXT: ldr z0, [x8, #1, mul vl]
+; CHECK-256-NEXT: add x8, x1, #128
+; CHECK-256-NEXT: str z0, [x8, #1, mul vl]
+; CHECK-256-NEXT: str z1, [x1, #4, mul vl]
+; CHECK-256-NEXT: ret
+;
+; CHECK-512-LABEL: nxv4f64:
+; CHECK-512: // %bb.0:
+; CHECK-512-NEXT: add x8, x0, #128
+; CHECK-512-NEXT: ldr z1, [x0, #2, mul vl]
+; CHECK-512-NEXT: ldr z0, [x8, #1, mul vl]
+; CHECK-512-NEXT: add x8, x1, #128
+; CHECK-512-NEXT: str z0, [x8, #1, mul vl]
+; CHECK-512-NEXT: str z1, [x1, #2, mul vl]
+; CHECK-512-NEXT: ret
+;
+; CHECK-1024-LABEL: nxv4f64:
+; CHECK-1024: // %bb.0:
+; CHECK-1024-NEXT: add x8, x0, #128
+; CHECK-1024-NEXT: ldr z1, [x0, #1, mul vl]
+; CHECK-1024-NEXT: ldr z0, [x8, #1, mul vl]
+; CHECK-1024-NEXT: add x8, x1, #128
+; CHECK-1024-NEXT: str z0, [x8, #1, mul vl]
+; CHECK-1024-NEXT: str z1, [x1, #1, mul vl]
+; CHECK-1024-NEXT: ret
+;
+; CHECK-2048-LABEL: nxv4f64:
+; CHECK-2048: // %bb.0:
+; CHECK-2048-NEXT: ptrue p0.d
+; CHECK-2048-NEXT: mov x8, #16 // =0x10
+; CHECK-2048-NEXT: add x9, x0, #128
+; CHECK-2048-NEXT: ldr z1, [x9, #1, mul vl]
+; CHECK-2048-NEXT: add x9, x1, #128
+; CHECK-2048-NEXT: ld1d { z0.d }, p0/z, [x0, x8, lsl #3]
+; CHECK-2048-NEXT: st1d { z0.d }, p0, [x1, x8, lsl #3]
+; CHECK-2048-NEXT: str z1, [x9, #1, mul vl]
+; CHECK-2048-NEXT: ret
+ %ldoff = getelementptr inbounds nuw i8, ptr %ldptr, i64 128
+ %stoff = getelementptr inbounds nuw i8, ptr %stptr, i64 128
+ %x = load <vscale x 4 x double>, ptr %ldoff, align 8
+ store <vscale x 4 x double> %x, ptr %stoff, align 8
+ ret void
+}
+
+define void @v8i32(ptr %ldptr, ptr %stptr) {
+; CHECK-LABEL: v8i32:
+; CHECK: // %bb.0:
+; CHECK-NEXT: ldp q0, q1, [x0, #64]
+; CHECK-NEXT: ldp q3, q2, [x0, #32]
+; CHECK-NEXT: stp q0, q1, [x1, #64]
+; CHECK-NEXT: stp q3, q2, [x1, #32]
+; CHECK-NEXT: ret
+;
+; CHECK-128-LABEL: v8i32:
+; CHECK-128: // %bb.0:
+; CHECK-128-NEXT: ldp q0, q1, [x0, #64]
+; CHECK-128-NEXT: ldp q3, q2, [x0, #32]
+; CHECK-128-NEXT: stp q0, q1, [x1, #64]
+; CHECK-128-NEXT: stp q3, q2, [x1, #32]
+; CHECK-128-NEXT: ret
+;
+; CHECK-256-LABEL: v8i32:
+; CHECK-256: // %bb.0:
+; CHECK-256-NEXT: ptrue p0.s
+; CHECK-256-NEXT: mov x8, #16 // =0x10
+; CHECK-256-NEXT: mov x9, #8 // =0x8
+; CHECK-256-NEXT: ld1w { z0.s }, p0/z, [x0, x8, lsl #2]
+; CHECK-256-NEXT: ld1w { z1.s }, p0/z, [x0, x9, lsl #2]
+; CHECK-256-NEXT: st1w { z0.s }, p0, [x1, x8, lsl #2]
+; CHECK-256-NEXT: st1w { z1.s }, p0, [x1, x9, lsl #2]
+; CHECK-256-NEXT: ret
+;
+; CHECK-512-LABEL: v8i32:
+; CHECK-512: // %bb.0:
+; CHECK-512-NEXT: ptrue p0.s
+; CHECK-512-NEXT: mov x8, #8 // =0x8
+; CHECK-512-NEXT: ld1w { z0.s }, p0/z, [x0, x8, lsl #2]
+; CHECK-512-NEXT: st1w { z0.s }, p0, [x1, x8, lsl #2]
+; CHECK-512-NEXT: ret
+;
+; CHECK-1024-LABEL: v8i32:
+; CHECK-1024: // %bb.0:
+; CHECK-1024-NEXT: ptrue p0.s, vl16
+; CHECK-1024-NEXT: mov x8, #8 // =0x8
+; CHECK-1024-NEXT: ld1w { z0.s }, p0/z, [x0, x8, lsl #2]
+; CHECK-1024-NEXT: st1w { z0.s }, p0, [x1, x8, lsl #2]
+; CHECK-1024-NEXT: ret
+;
+; CHECK-2048-LABEL: v8i32:
+; CHECK-2048: // %bb.0:
+; CHECK-2048-NEXT: ptrue p0.s, vl16
+; CHECK-2048-NEXT: mov x8, #8 // =0x8
+; CHECK-2048-NEXT: ld1w { z0.s }, p0/z, [x0, x8, lsl #2]
+; CHECK-2048-NEXT: st1w { z0.s }, p0, [x1, x8, lsl #2]
+; CHECK-2048-NEXT: ret
+ %ldoff = getelementptr inbounds nuw i8, ptr %ldptr, i64 32
+ %stoff = getelementptr inbounds nuw i8, ptr %stptr, i64 32
+ %x = load <16 x i32>, ptr %ldoff, align 4
+ store <16 x i32> %x, ptr %stoff, align 4
+ ret void
+}
+
+; FIXME: This is wrong for VLS.
+define void @v8i32_vscale(ptr %0) {
+; CHECK-LABEL: v8i32_vscale:
+; CHECK: // %bb.0:
+; CHECK-NEXT: movi v0.4s, #1
+; CHECK-NEXT: rdvl x8, #2
+; CHECK-NEXT: add x8, x0, x8
+; CHECK-NEXT: stp q0, q0, [x8]
+; CHECK-NEXT: ret
+;
+; CHECK-128-LABEL: v8i32_vscale:
+; CHECK-128: // %bb.0:
+; CHECK-128-NEXT: movi v0.4s, #1
+; CHECK-128-NEXT: rdvl x8, #2
+; CHECK-128-NEXT: add x8, x0, x8
+; CHECK-128-NEXT: stp q0, q0, [x8]
+; CHECK-128-NEXT: ret
+;
+; CHECK-256-LABEL: v8i32_vscale:
+; CHECK-256: // %bb.0:
+; CHECK-256-NEXT: mov z0.s, #1 // =0x1
+; CHECK-256-NEXT: ptrue p0.s
+; CHECK-256-NEXT: st1w { z0.s }, p0, [x0, #1, mul vl]
+; CHECK-256-NEXT: ret
+;
+; CHECK-512-LABEL: v8i32_vscale:
+; CHECK-512: // %bb.0:
+; CHECK-512-NEXT: mov z0.s, #1 // =0x1
+; CHECK-512-NEXT: ptrue p0.s, vl8
+; CHECK-512-NEXT: st1w { z0.s }, p0, [x0, #1, mul vl]
+; CHECK-512-NEXT: ret
+;
+; CHECK-1024-LABEL: v8i32_vscale:
+; CHECK-1024: // %bb.0:
+; CHECK-1024-NEXT: mov z0.s, #1 // =0x1
+; CHECK-1024-NEXT: ptrue p0.s, vl8
+; CHECK-1024-NEXT: st1w { z0.s }, p0, [x0, #1, mul vl]
+; CHECK-1024-NEXT: ret
+;
+; CHECK-2048-LABEL: v8i32_vscale:
+; CHECK-2048: // %bb.0:
+; CHECK-2048-NEXT: mov z0.s, #1 // =0x1
+; CHECK-2048-NEXT: ptrue p0.s, vl8
+; CHECK-2048-NEXT: st1w { z0.s }, p0, [x0, #1, mul vl]
+; CHECK-2048-NEXT: ret
+ %vl = call i64 @llvm.vscale()
+ %vlx = shl i64 %vl, 5
+ %2 = getelementptr inbounds nuw i8, ptr %0, i64 %vlx
+ store <8 x i32> splat (i32 1), ptr %2, align 4
+ ret void
+}
|
Sorry for the delay and thanks for the investigation @rj-jesus. This is sooooo not intentional behaviour. VLS based auto vectorisation was implemented before the VLS ACLE extensions and by that time it's likely fixed length calls to I think the problem sits in I've quickly tested the following, which corrects the behaviour for your test case but I've not investigated what other nodes need to be handled so as not to hit the unreachable.
Do you mind running with such an approach for your PR? If not, I'm happy to finish it off and push a PR for yours to build upon. |
Thank you very much for the explanation, @paulwalker-arm - that makes a lot of sense! I'll try your suggestion tomorrow. I'll let you know how it goes. :) |
Hi @paulwalker-arm, thanks again for your suggestion. I think the only node missing was As far as I could tell, the only Please let me know if you have any other comments! |
This restores commit f01e760.
The original patch from #129732 exposed what seems to be a bug in
SelectAddrModeIndexedSVE
.Currently, the offset returned by
SelectAddrModeIndexedSVE
is computed by dividing a VL-based offset (MulImm
) by the known minimum width ofMemVT
. This works whenMemVT
is a scalable vector type because scalable types are intrinsically VL-based. However, for fixed vector types,MemVT
is not scaled to the SVE vector length, which may lead to inaccurate results. For example, forvscale * 32
, I expect the offset returned to be2*VL
, irrespective of the width ofMemVT
(unless the latter is an unpacked SVE type). VLA codegen agrees with this. However, for<8 x i32>
vectors, VLS codegen (which usesSelectAddrModeIndexedSVE
) returns1*VL
: https://godbolt.org/z/7149fejGo.Is this intentional?
Although this seems to affect both VSCALE-based and Constant-based offsets, I believe we didn't come across it earlier because we don't generate combinations of VSCALE offsets + fixed vectors often. Enabling the Constant-based path made the problem (assuming it is a problem) obvious because combinations of Constant offsets + fixed vectors are more common.
To work around the issue temporarily, I added an early exit to the Constant-based path for fixed vector types.
This doesn't affect the VSCALE path because I wanted to confirm whether the current behaviour is intentional or not.
I think the long-term solution is to set
MemWidthBytes = 16
for fixed vectors, which should fix the address calculation for both paths. I'm happy to do this here or open a separate PR, but first I wanted to confirm whether this is a viable solution (hence why I added a more conservative solution for the time being).What do you think?