-
Notifications
You must be signed in to change notification settings - Fork 14.8k
[RISCV] Add initial support of memcmp expansion #107548
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RISCV] Add initial support of memcmp expansion #107548
Conversation
Created using spr 1.3.6-beta.1
@llvm/pr-subscribers-backend-risc-v Author: Pengcheng Wang (wangpc-pp) ChangesThere are two passes that have dependency on the implementation This PR adds the initial implementation of Vector case hasn't been tested as we don't generate inlined vector Patch is 42.75 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/107548.diff 3 Files Affected:
diff --git a/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp b/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp
index e809e15eacf696..ad532aadc83266 100644
--- a/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp
+++ b/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp
@@ -2113,3 +2113,18 @@ bool RISCVTTIImpl::shouldConsiderAddressTypePromotion(
}
return Considerable;
}
+
+RISCVTTIImpl::TTI::MemCmpExpansionOptions
+RISCVTTIImpl::enableMemCmpExpansion(bool OptSize, bool IsZeroCmp) const {
+ TTI::MemCmpExpansionOptions Options;
+ // FIXME: Vector haven't been tested.
+ Options.AllowOverlappingLoads =
+ (ST->enableUnalignedScalarMem() || ST->enableUnalignedScalarMem());
+ Options.MaxNumLoads = TLI->getMaxExpandSizeMemcmp(OptSize);
+ Options.NumLoadsPerBlock = Options.MaxNumLoads;
+ if (ST->is64Bit())
+ Options.LoadSizes.push_back(8);
+ llvm::append_range(Options.LoadSizes, ArrayRef({4, 2, 1}));
+ Options.AllowedTailExpansions = {3, 5, 6};
+ return Options;
+}
diff --git a/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h b/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h
index 763b89bfec0a66..ee9bed09df97f3 100644
--- a/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h
+++ b/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h
@@ -404,6 +404,9 @@ class RISCVTTIImpl : public BasicTTIImplBase<RISCVTTIImpl> {
shouldConsiderAddressTypePromotion(const Instruction &I,
bool &AllowPromotionWithoutCommonHeader);
std::optional<unsigned> getMinPageSize() const { return 4096; }
+
+ TTI::MemCmpExpansionOptions enableMemCmpExpansion(bool OptSize,
+ bool IsZeroCmp) const;
};
} // end namespace llvm
diff --git a/llvm/test/CodeGen/RISCV/memcmp.ll b/llvm/test/CodeGen/RISCV/memcmp.ll
new file mode 100644
index 00000000000000..652cd02e2c750a
--- /dev/null
+++ b/llvm/test/CodeGen/RISCV/memcmp.ll
@@ -0,0 +1,932 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
+; RUN: sed 's/iXLen/i32/g' %s | llc -mtriple=riscv32 -O2 | FileCheck %s --check-prefix=CHECK-ALIGNED-RV32
+; RUN: sed 's/iXLen/i64/g' %s | llc -mtriple=riscv64 -O2 | FileCheck %s --check-prefix=CHECK-ALIGNED-RV64
+; RUN: sed 's/iXLen/i32/g' %s | llc -mtriple=riscv32 -mattr=+unaligned-scalar-mem -O2 \
+; RUN: | FileCheck %s --check-prefix=CHECK-UNALIGNED-RV32
+; RUN: sed 's/iXLen/i64/g' %s | llc -mtriple=riscv64 -mattr=+unaligned-scalar-mem -O2 \
+; RUN: | FileCheck %s --check-prefix=CHECK-UNALIGNED-RV64
+
+declare i32 @bcmp(i8*, i8*, iXLen) nounwind readonly
+declare i32 @memcmp(i8*, i8*, iXLen) nounwind readonly
+
+define i1 @bcmp_size_15(i8* %s1, i8* %s2) {
+; CHECK-ALIGNED-RV32-LABEL: bcmp_size_15:
+; CHECK-ALIGNED-RV32: # %bb.0: # %entry
+; CHECK-ALIGNED-RV32-NEXT: lbu a2, 1(a0)
+; CHECK-ALIGNED-RV32-NEXT: lbu a3, 0(a0)
+; CHECK-ALIGNED-RV32-NEXT: lbu a4, 2(a0)
+; CHECK-ALIGNED-RV32-NEXT: lbu a5, 3(a0)
+; CHECK-ALIGNED-RV32-NEXT: slli a2, a2, 8
+; CHECK-ALIGNED-RV32-NEXT: or a2, a2, a3
+; CHECK-ALIGNED-RV32-NEXT: slli a4, a4, 16
+; CHECK-ALIGNED-RV32-NEXT: slli a5, a5, 24
+; CHECK-ALIGNED-RV32-NEXT: or a4, a5, a4
+; CHECK-ALIGNED-RV32-NEXT: or a2, a4, a2
+; CHECK-ALIGNED-RV32-NEXT: lbu a3, 1(a1)
+; CHECK-ALIGNED-RV32-NEXT: lbu a4, 0(a1)
+; CHECK-ALIGNED-RV32-NEXT: lbu a5, 2(a1)
+; CHECK-ALIGNED-RV32-NEXT: lbu a6, 3(a1)
+; CHECK-ALIGNED-RV32-NEXT: slli a3, a3, 8
+; CHECK-ALIGNED-RV32-NEXT: or a3, a3, a4
+; CHECK-ALIGNED-RV32-NEXT: slli a5, a5, 16
+; CHECK-ALIGNED-RV32-NEXT: slli a6, a6, 24
+; CHECK-ALIGNED-RV32-NEXT: or a4, a6, a5
+; CHECK-ALIGNED-RV32-NEXT: or a3, a4, a3
+; CHECK-ALIGNED-RV32-NEXT: xor a2, a2, a3
+; CHECK-ALIGNED-RV32-NEXT: lbu a3, 5(a0)
+; CHECK-ALIGNED-RV32-NEXT: lbu a4, 4(a0)
+; CHECK-ALIGNED-RV32-NEXT: lbu a5, 6(a0)
+; CHECK-ALIGNED-RV32-NEXT: lbu a6, 7(a0)
+; CHECK-ALIGNED-RV32-NEXT: slli a3, a3, 8
+; CHECK-ALIGNED-RV32-NEXT: or a3, a3, a4
+; CHECK-ALIGNED-RV32-NEXT: slli a5, a5, 16
+; CHECK-ALIGNED-RV32-NEXT: slli a6, a6, 24
+; CHECK-ALIGNED-RV32-NEXT: or a4, a6, a5
+; CHECK-ALIGNED-RV32-NEXT: or a3, a4, a3
+; CHECK-ALIGNED-RV32-NEXT: lbu a4, 5(a1)
+; CHECK-ALIGNED-RV32-NEXT: lbu a5, 4(a1)
+; CHECK-ALIGNED-RV32-NEXT: lbu a6, 6(a1)
+; CHECK-ALIGNED-RV32-NEXT: lbu a7, 7(a1)
+; CHECK-ALIGNED-RV32-NEXT: slli a4, a4, 8
+; CHECK-ALIGNED-RV32-NEXT: or a4, a4, a5
+; CHECK-ALIGNED-RV32-NEXT: slli a6, a6, 16
+; CHECK-ALIGNED-RV32-NEXT: slli a7, a7, 24
+; CHECK-ALIGNED-RV32-NEXT: or a5, a7, a6
+; CHECK-ALIGNED-RV32-NEXT: or a4, a5, a4
+; CHECK-ALIGNED-RV32-NEXT: xor a3, a3, a4
+; CHECK-ALIGNED-RV32-NEXT: lbu a4, 9(a0)
+; CHECK-ALIGNED-RV32-NEXT: lbu a5, 8(a0)
+; CHECK-ALIGNED-RV32-NEXT: lbu a6, 10(a0)
+; CHECK-ALIGNED-RV32-NEXT: lbu a7, 11(a0)
+; CHECK-ALIGNED-RV32-NEXT: slli a4, a4, 8
+; CHECK-ALIGNED-RV32-NEXT: or a4, a4, a5
+; CHECK-ALIGNED-RV32-NEXT: slli a6, a6, 16
+; CHECK-ALIGNED-RV32-NEXT: slli a7, a7, 24
+; CHECK-ALIGNED-RV32-NEXT: or a5, a7, a6
+; CHECK-ALIGNED-RV32-NEXT: or a4, a5, a4
+; CHECK-ALIGNED-RV32-NEXT: lbu a5, 9(a1)
+; CHECK-ALIGNED-RV32-NEXT: lbu a6, 8(a1)
+; CHECK-ALIGNED-RV32-NEXT: lbu a7, 10(a1)
+; CHECK-ALIGNED-RV32-NEXT: lbu t0, 11(a1)
+; CHECK-ALIGNED-RV32-NEXT: slli a5, a5, 8
+; CHECK-ALIGNED-RV32-NEXT: or a5, a5, a6
+; CHECK-ALIGNED-RV32-NEXT: slli a7, a7, 16
+; CHECK-ALIGNED-RV32-NEXT: slli t0, t0, 24
+; CHECK-ALIGNED-RV32-NEXT: or a6, t0, a7
+; CHECK-ALIGNED-RV32-NEXT: lbu a7, 13(a0)
+; CHECK-ALIGNED-RV32-NEXT: lbu t0, 12(a0)
+; CHECK-ALIGNED-RV32-NEXT: or a5, a6, a5
+; CHECK-ALIGNED-RV32-NEXT: xor a4, a4, a5
+; CHECK-ALIGNED-RV32-NEXT: slli a7, a7, 8
+; CHECK-ALIGNED-RV32-NEXT: or a5, a7, t0
+; CHECK-ALIGNED-RV32-NEXT: lbu a6, 13(a1)
+; CHECK-ALIGNED-RV32-NEXT: lbu a7, 12(a1)
+; CHECK-ALIGNED-RV32-NEXT: lbu a0, 14(a0)
+; CHECK-ALIGNED-RV32-NEXT: lbu a1, 14(a1)
+; CHECK-ALIGNED-RV32-NEXT: slli a6, a6, 8
+; CHECK-ALIGNED-RV32-NEXT: or a6, a6, a7
+; CHECK-ALIGNED-RV32-NEXT: xor a5, a5, a6
+; CHECK-ALIGNED-RV32-NEXT: xor a0, a0, a1
+; CHECK-ALIGNED-RV32-NEXT: or a2, a2, a3
+; CHECK-ALIGNED-RV32-NEXT: or a4, a4, a5
+; CHECK-ALIGNED-RV32-NEXT: or a2, a2, a4
+; CHECK-ALIGNED-RV32-NEXT: or a0, a2, a0
+; CHECK-ALIGNED-RV32-NEXT: seqz a0, a0
+; CHECK-ALIGNED-RV32-NEXT: ret
+;
+; CHECK-ALIGNED-RV64-LABEL: bcmp_size_15:
+; CHECK-ALIGNED-RV64: # %bb.0: # %entry
+; CHECK-ALIGNED-RV64-NEXT: lbu a2, 1(a0)
+; CHECK-ALIGNED-RV64-NEXT: lbu a3, 0(a0)
+; CHECK-ALIGNED-RV64-NEXT: lbu a4, 2(a0)
+; CHECK-ALIGNED-RV64-NEXT: lbu a5, 3(a0)
+; CHECK-ALIGNED-RV64-NEXT: slli a2, a2, 8
+; CHECK-ALIGNED-RV64-NEXT: or a2, a2, a3
+; CHECK-ALIGNED-RV64-NEXT: slli a4, a4, 16
+; CHECK-ALIGNED-RV64-NEXT: slli a5, a5, 24
+; CHECK-ALIGNED-RV64-NEXT: or a4, a5, a4
+; CHECK-ALIGNED-RV64-NEXT: or a2, a4, a2
+; CHECK-ALIGNED-RV64-NEXT: lbu a3, 5(a0)
+; CHECK-ALIGNED-RV64-NEXT: lbu a4, 4(a0)
+; CHECK-ALIGNED-RV64-NEXT: lbu a5, 6(a0)
+; CHECK-ALIGNED-RV64-NEXT: lbu a6, 7(a0)
+; CHECK-ALIGNED-RV64-NEXT: slli a3, a3, 8
+; CHECK-ALIGNED-RV64-NEXT: or a3, a3, a4
+; CHECK-ALIGNED-RV64-NEXT: slli a5, a5, 16
+; CHECK-ALIGNED-RV64-NEXT: slli a6, a6, 24
+; CHECK-ALIGNED-RV64-NEXT: or a4, a6, a5
+; CHECK-ALIGNED-RV64-NEXT: or a3, a4, a3
+; CHECK-ALIGNED-RV64-NEXT: slli a3, a3, 32
+; CHECK-ALIGNED-RV64-NEXT: or a2, a3, a2
+; CHECK-ALIGNED-RV64-NEXT: lbu a3, 1(a1)
+; CHECK-ALIGNED-RV64-NEXT: lbu a4, 0(a1)
+; CHECK-ALIGNED-RV64-NEXT: lbu a5, 2(a1)
+; CHECK-ALIGNED-RV64-NEXT: lbu a6, 3(a1)
+; CHECK-ALIGNED-RV64-NEXT: slli a3, a3, 8
+; CHECK-ALIGNED-RV64-NEXT: or a3, a3, a4
+; CHECK-ALIGNED-RV64-NEXT: slli a5, a5, 16
+; CHECK-ALIGNED-RV64-NEXT: slli a6, a6, 24
+; CHECK-ALIGNED-RV64-NEXT: or a4, a6, a5
+; CHECK-ALIGNED-RV64-NEXT: or a3, a4, a3
+; CHECK-ALIGNED-RV64-NEXT: lbu a4, 5(a1)
+; CHECK-ALIGNED-RV64-NEXT: lbu a5, 4(a1)
+; CHECK-ALIGNED-RV64-NEXT: lbu a6, 6(a1)
+; CHECK-ALIGNED-RV64-NEXT: lbu a7, 7(a1)
+; CHECK-ALIGNED-RV64-NEXT: slli a4, a4, 8
+; CHECK-ALIGNED-RV64-NEXT: or a4, a4, a5
+; CHECK-ALIGNED-RV64-NEXT: slli a6, a6, 16
+; CHECK-ALIGNED-RV64-NEXT: slli a7, a7, 24
+; CHECK-ALIGNED-RV64-NEXT: or a5, a7, a6
+; CHECK-ALIGNED-RV64-NEXT: or a4, a5, a4
+; CHECK-ALIGNED-RV64-NEXT: slli a4, a4, 32
+; CHECK-ALIGNED-RV64-NEXT: or a3, a4, a3
+; CHECK-ALIGNED-RV64-NEXT: xor a2, a2, a3
+; CHECK-ALIGNED-RV64-NEXT: lbu a3, 9(a0)
+; CHECK-ALIGNED-RV64-NEXT: lbu a4, 8(a0)
+; CHECK-ALIGNED-RV64-NEXT: lbu a5, 10(a0)
+; CHECK-ALIGNED-RV64-NEXT: lbu a6, 11(a0)
+; CHECK-ALIGNED-RV64-NEXT: slli a3, a3, 8
+; CHECK-ALIGNED-RV64-NEXT: or a3, a3, a4
+; CHECK-ALIGNED-RV64-NEXT: slli a5, a5, 16
+; CHECK-ALIGNED-RV64-NEXT: slli a6, a6, 24
+; CHECK-ALIGNED-RV64-NEXT: or a4, a6, a5
+; CHECK-ALIGNED-RV64-NEXT: or a3, a4, a3
+; CHECK-ALIGNED-RV64-NEXT: lbu a4, 9(a1)
+; CHECK-ALIGNED-RV64-NEXT: lbu a5, 8(a1)
+; CHECK-ALIGNED-RV64-NEXT: lbu a6, 10(a1)
+; CHECK-ALIGNED-RV64-NEXT: lbu a7, 11(a1)
+; CHECK-ALIGNED-RV64-NEXT: slli a4, a4, 8
+; CHECK-ALIGNED-RV64-NEXT: or a4, a4, a5
+; CHECK-ALIGNED-RV64-NEXT: slli a6, a6, 16
+; CHECK-ALIGNED-RV64-NEXT: slli a7, a7, 24
+; CHECK-ALIGNED-RV64-NEXT: or a5, a7, a6
+; CHECK-ALIGNED-RV64-NEXT: lbu a6, 13(a0)
+; CHECK-ALIGNED-RV64-NEXT: lbu a7, 12(a0)
+; CHECK-ALIGNED-RV64-NEXT: or a4, a5, a4
+; CHECK-ALIGNED-RV64-NEXT: xor a3, a3, a4
+; CHECK-ALIGNED-RV64-NEXT: slli a6, a6, 8
+; CHECK-ALIGNED-RV64-NEXT: or a4, a6, a7
+; CHECK-ALIGNED-RV64-NEXT: lbu a5, 13(a1)
+; CHECK-ALIGNED-RV64-NEXT: lbu a6, 12(a1)
+; CHECK-ALIGNED-RV64-NEXT: lbu a0, 14(a0)
+; CHECK-ALIGNED-RV64-NEXT: lbu a1, 14(a1)
+; CHECK-ALIGNED-RV64-NEXT: slli a5, a5, 8
+; CHECK-ALIGNED-RV64-NEXT: or a5, a5, a6
+; CHECK-ALIGNED-RV64-NEXT: xor a4, a4, a5
+; CHECK-ALIGNED-RV64-NEXT: xor a0, a0, a1
+; CHECK-ALIGNED-RV64-NEXT: or a0, a4, a0
+; CHECK-ALIGNED-RV64-NEXT: or a0, a3, a0
+; CHECK-ALIGNED-RV64-NEXT: or a0, a2, a0
+; CHECK-ALIGNED-RV64-NEXT: seqz a0, a0
+; CHECK-ALIGNED-RV64-NEXT: ret
+;
+; CHECK-UNALIGNED-RV32-LABEL: bcmp_size_15:
+; CHECK-UNALIGNED-RV32: # %bb.0: # %entry
+; CHECK-UNALIGNED-RV32-NEXT: lw a2, 0(a0)
+; CHECK-UNALIGNED-RV32-NEXT: lw a3, 0(a1)
+; CHECK-UNALIGNED-RV32-NEXT: lw a4, 4(a0)
+; CHECK-UNALIGNED-RV32-NEXT: lw a5, 4(a1)
+; CHECK-UNALIGNED-RV32-NEXT: lw a6, 8(a0)
+; CHECK-UNALIGNED-RV32-NEXT: lw a7, 8(a1)
+; CHECK-UNALIGNED-RV32-NEXT: lw a0, 11(a0)
+; CHECK-UNALIGNED-RV32-NEXT: lw a1, 11(a1)
+; CHECK-UNALIGNED-RV32-NEXT: xor a2, a2, a3
+; CHECK-UNALIGNED-RV32-NEXT: xor a4, a4, a5
+; CHECK-UNALIGNED-RV32-NEXT: xor a3, a6, a7
+; CHECK-UNALIGNED-RV32-NEXT: xor a0, a0, a1
+; CHECK-UNALIGNED-RV32-NEXT: or a2, a2, a4
+; CHECK-UNALIGNED-RV32-NEXT: or a0, a3, a0
+; CHECK-UNALIGNED-RV32-NEXT: or a0, a2, a0
+; CHECK-UNALIGNED-RV32-NEXT: seqz a0, a0
+; CHECK-UNALIGNED-RV32-NEXT: ret
+;
+; CHECK-UNALIGNED-RV64-LABEL: bcmp_size_15:
+; CHECK-UNALIGNED-RV64: # %bb.0: # %entry
+; CHECK-UNALIGNED-RV64-NEXT: ld a2, 0(a0)
+; CHECK-UNALIGNED-RV64-NEXT: ld a3, 0(a1)
+; CHECK-UNALIGNED-RV64-NEXT: ld a0, 7(a0)
+; CHECK-UNALIGNED-RV64-NEXT: ld a1, 7(a1)
+; CHECK-UNALIGNED-RV64-NEXT: xor a2, a2, a3
+; CHECK-UNALIGNED-RV64-NEXT: xor a0, a0, a1
+; CHECK-UNALIGNED-RV64-NEXT: or a0, a2, a0
+; CHECK-UNALIGNED-RV64-NEXT: seqz a0, a0
+; CHECK-UNALIGNED-RV64-NEXT: ret
+entry:
+ %bcmp = call i32 @bcmp(i8* %s1, i8* %s2, iXLen 15)
+ %ret = icmp eq i32 %bcmp, 0
+ ret i1 %ret
+}
+
+define i1 @bcmp_size_31(i8* %s1, i8* %s2) {
+; CHECK-ALIGNED-RV32-LABEL: bcmp_size_31:
+; CHECK-ALIGNED-RV32: # %bb.0: # %entry
+; CHECK-ALIGNED-RV32-NEXT: addi sp, sp, -16
+; CHECK-ALIGNED-RV32-NEXT: .cfi_def_cfa_offset 16
+; CHECK-ALIGNED-RV32-NEXT: sw ra, 12(sp) # 4-byte Folded Spill
+; CHECK-ALIGNED-RV32-NEXT: .cfi_offset ra, -4
+; CHECK-ALIGNED-RV32-NEXT: li a2, 31
+; CHECK-ALIGNED-RV32-NEXT: call bcmp
+; CHECK-ALIGNED-RV32-NEXT: seqz a0, a0
+; CHECK-ALIGNED-RV32-NEXT: lw ra, 12(sp) # 4-byte Folded Reload
+; CHECK-ALIGNED-RV32-NEXT: addi sp, sp, 16
+; CHECK-ALIGNED-RV32-NEXT: ret
+;
+; CHECK-ALIGNED-RV64-LABEL: bcmp_size_31:
+; CHECK-ALIGNED-RV64: # %bb.0: # %entry
+; CHECK-ALIGNED-RV64-NEXT: lbu a2, 1(a0)
+; CHECK-ALIGNED-RV64-NEXT: lbu a3, 0(a0)
+; CHECK-ALIGNED-RV64-NEXT: lbu a4, 2(a0)
+; CHECK-ALIGNED-RV64-NEXT: lbu a5, 3(a0)
+; CHECK-ALIGNED-RV64-NEXT: slli a2, a2, 8
+; CHECK-ALIGNED-RV64-NEXT: or a2, a2, a3
+; CHECK-ALIGNED-RV64-NEXT: slli a4, a4, 16
+; CHECK-ALIGNED-RV64-NEXT: slli a5, a5, 24
+; CHECK-ALIGNED-RV64-NEXT: or a4, a5, a4
+; CHECK-ALIGNED-RV64-NEXT: or a2, a4, a2
+; CHECK-ALIGNED-RV64-NEXT: lbu a3, 5(a0)
+; CHECK-ALIGNED-RV64-NEXT: lbu a4, 4(a0)
+; CHECK-ALIGNED-RV64-NEXT: lbu a5, 6(a0)
+; CHECK-ALIGNED-RV64-NEXT: lbu a6, 7(a0)
+; CHECK-ALIGNED-RV64-NEXT: slli a3, a3, 8
+; CHECK-ALIGNED-RV64-NEXT: or a3, a3, a4
+; CHECK-ALIGNED-RV64-NEXT: slli a5, a5, 16
+; CHECK-ALIGNED-RV64-NEXT: slli a6, a6, 24
+; CHECK-ALIGNED-RV64-NEXT: or a4, a6, a5
+; CHECK-ALIGNED-RV64-NEXT: or a3, a4, a3
+; CHECK-ALIGNED-RV64-NEXT: slli a3, a3, 32
+; CHECK-ALIGNED-RV64-NEXT: or a2, a3, a2
+; CHECK-ALIGNED-RV64-NEXT: lbu a3, 1(a1)
+; CHECK-ALIGNED-RV64-NEXT: lbu a4, 0(a1)
+; CHECK-ALIGNED-RV64-NEXT: lbu a5, 2(a1)
+; CHECK-ALIGNED-RV64-NEXT: lbu a6, 3(a1)
+; CHECK-ALIGNED-RV64-NEXT: slli a3, a3, 8
+; CHECK-ALIGNED-RV64-NEXT: or a3, a3, a4
+; CHECK-ALIGNED-RV64-NEXT: slli a5, a5, 16
+; CHECK-ALIGNED-RV64-NEXT: slli a6, a6, 24
+; CHECK-ALIGNED-RV64-NEXT: or a4, a6, a5
+; CHECK-ALIGNED-RV64-NEXT: or a3, a4, a3
+; CHECK-ALIGNED-RV64-NEXT: lbu a4, 5(a1)
+; CHECK-ALIGNED-RV64-NEXT: lbu a5, 4(a1)
+; CHECK-ALIGNED-RV64-NEXT: lbu a6, 6(a1)
+; CHECK-ALIGNED-RV64-NEXT: lbu a7, 7(a1)
+; CHECK-ALIGNED-RV64-NEXT: slli a4, a4, 8
+; CHECK-ALIGNED-RV64-NEXT: or a4, a4, a5
+; CHECK-ALIGNED-RV64-NEXT: slli a6, a6, 16
+; CHECK-ALIGNED-RV64-NEXT: slli a7, a7, 24
+; CHECK-ALIGNED-RV64-NEXT: or a5, a7, a6
+; CHECK-ALIGNED-RV64-NEXT: or a4, a5, a4
+; CHECK-ALIGNED-RV64-NEXT: slli a4, a4, 32
+; CHECK-ALIGNED-RV64-NEXT: or a3, a4, a3
+; CHECK-ALIGNED-RV64-NEXT: xor a2, a2, a3
+; CHECK-ALIGNED-RV64-NEXT: lbu a3, 9(a0)
+; CHECK-ALIGNED-RV64-NEXT: lbu a4, 8(a0)
+; CHECK-ALIGNED-RV64-NEXT: lbu a5, 10(a0)
+; CHECK-ALIGNED-RV64-NEXT: lbu a6, 11(a0)
+; CHECK-ALIGNED-RV64-NEXT: slli a3, a3, 8
+; CHECK-ALIGNED-RV64-NEXT: or a3, a3, a4
+; CHECK-ALIGNED-RV64-NEXT: slli a5, a5, 16
+; CHECK-ALIGNED-RV64-NEXT: slli a6, a6, 24
+; CHECK-ALIGNED-RV64-NEXT: or a4, a6, a5
+; CHECK-ALIGNED-RV64-NEXT: or a3, a4, a3
+; CHECK-ALIGNED-RV64-NEXT: lbu a4, 13(a0)
+; CHECK-ALIGNED-RV64-NEXT: lbu a5, 12(a0)
+; CHECK-ALIGNED-RV64-NEXT: lbu a6, 14(a0)
+; CHECK-ALIGNED-RV64-NEXT: lbu a7, 15(a0)
+; CHECK-ALIGNED-RV64-NEXT: slli a4, a4, 8
+; CHECK-ALIGNED-RV64-NEXT: or a4, a4, a5
+; CHECK-ALIGNED-RV64-NEXT: slli a6, a6, 16
+; CHECK-ALIGNED-RV64-NEXT: slli a7, a7, 24
+; CHECK-ALIGNED-RV64-NEXT: or a5, a7, a6
+; CHECK-ALIGNED-RV64-NEXT: or a4, a5, a4
+; CHECK-ALIGNED-RV64-NEXT: slli a4, a4, 32
+; CHECK-ALIGNED-RV64-NEXT: or a3, a4, a3
+; CHECK-ALIGNED-RV64-NEXT: lbu a4, 9(a1)
+; CHECK-ALIGNED-RV64-NEXT: lbu a5, 8(a1)
+; CHECK-ALIGNED-RV64-NEXT: lbu a6, 10(a1)
+; CHECK-ALIGNED-RV64-NEXT: lbu a7, 11(a1)
+; CHECK-ALIGNED-RV64-NEXT: slli a4, a4, 8
+; CHECK-ALIGNED-RV64-NEXT: or a4, a4, a5
+; CHECK-ALIGNED-RV64-NEXT: slli a6, a6, 16
+; CHECK-ALIGNED-RV64-NEXT: slli a7, a7, 24
+; CHECK-ALIGNED-RV64-NEXT: or a5, a7, a6
+; CHECK-ALIGNED-RV64-NEXT: or a4, a5, a4
+; CHECK-ALIGNED-RV64-NEXT: lbu a5, 13(a1)
+; CHECK-ALIGNED-RV64-NEXT: lbu a6, 12(a1)
+; CHECK-ALIGNED-RV64-NEXT: lbu a7, 14(a1)
+; CHECK-ALIGNED-RV64-NEXT: lbu t0, 15(a1)
+; CHECK-ALIGNED-RV64-NEXT: slli a5, a5, 8
+; CHECK-ALIGNED-RV64-NEXT: or a5, a5, a6
+; CHECK-ALIGNED-RV64-NEXT: slli a7, a7, 16
+; CHECK-ALIGNED-RV64-NEXT: slli t0, t0, 24
+; CHECK-ALIGNED-RV64-NEXT: or a6, t0, a7
+; CHECK-ALIGNED-RV64-NEXT: or a5, a6, a5
+; CHECK-ALIGNED-RV64-NEXT: slli a5, a5, 32
+; CHECK-ALIGNED-RV64-NEXT: or a4, a5, a4
+; CHECK-ALIGNED-RV64-NEXT: xor a3, a3, a4
+; CHECK-ALIGNED-RV64-NEXT: lbu a4, 17(a0)
+; CHECK-ALIGNED-RV64-NEXT: lbu a5, 16(a0)
+; CHECK-ALIGNED-RV64-NEXT: lbu a6, 18(a0)
+; CHECK-ALIGNED-RV64-NEXT: lbu a7, 19(a0)
+; CHECK-ALIGNED-RV64-NEXT: slli a4, a4, 8
+; CHECK-ALIGNED-RV64-NEXT: or a4, a4, a5
+; CHECK-ALIGNED-RV64-NEXT: slli a6, a6, 16
+; CHECK-ALIGNED-RV64-NEXT: slli a7, a7, 24
+; CHECK-ALIGNED-RV64-NEXT: or a5, a7, a6
+; CHECK-ALIGNED-RV64-NEXT: or a4, a5, a4
+; CHECK-ALIGNED-RV64-NEXT: lbu a5, 21(a0)
+; CHECK-ALIGNED-RV64-NEXT: lbu a6, 20(a0)
+; CHECK-ALIGNED-RV64-NEXT: lbu a7, 22(a0)
+; CHECK-ALIGNED-RV64-NEXT: lbu t0, 23(a0)
+; CHECK-ALIGNED-RV64-NEXT: slli a5, a5, 8
+; CHECK-ALIGNED-RV64-NEXT: or a5, a5, a6
+; CHECK-ALIGNED-RV64-NEXT: slli a7, a7, 16
+; CHECK-ALIGNED-RV64-NEXT: slli t0, t0, 24
+; CHECK-ALIGNED-RV64-NEXT: or a6, t0, a7
+; CHECK-ALIGNED-RV64-NEXT: or a5, a6, a5
+; CHECK-ALIGNED-RV64-NEXT: slli a5, a5, 32
+; CHECK-ALIGNED-RV64-NEXT: or a4, a5, a4
+; CHECK-ALIGNED-RV64-NEXT: lbu a5, 17(a1)
+; CHECK-ALIGNED-RV64-NEXT: lbu a6, 16(a1)
+; CHECK-ALIGNED-RV64-NEXT: lbu a7, 18(a1)
+; CHECK-ALIGNED-RV64-NEXT: lbu t0, 19(a1)
+; CHECK-ALIGNED-RV64-NEXT: slli a5, a5, 8
+; CHECK-ALIGNED-RV64-NEXT: or a5, a5, a6
+; CHECK-ALIGNED-RV64-NEXT: slli a7, a7, 16
+; CHECK-ALIGNED-RV64-NEXT: slli t0, t0, 24
+; CHECK-ALIGNED-RV64-NEXT: or a6, t0, a7
+; CHECK-ALIGNED-RV64-NEXT: or a5, a6, a5
+; CHECK-ALIGNED-RV64-NEXT: lbu a6, 21(a1)
+; CHECK-ALIGNED-RV64-NEXT: lbu a7, 20(a1)
+; CHECK-ALIGNED-RV64-NEXT: lbu t0, 22(a1)
+; CHECK-ALIGNED-RV64-NEXT: lbu t1, 23(a1)
+; CHECK-ALIGNED-RV64-NEXT: slli a6, a6, 8
+; CHECK-ALIGNED-RV64-NEXT: or a6, a6, a7
+; CHECK-ALIGNED-RV64-NEXT: slli t0, t0, 16
+; CHECK-ALIGNED-RV64-NEXT: slli t1, t1, 24
+; CHECK-ALIGNED-RV64-NEXT: or a7, t1, t0
+; CHECK-ALIGNED-RV64-NEXT: or a6, a7, a6
+; CHECK-ALIGNED-RV64-NEXT: slli a6, a6, 32
+; CHECK-ALIGNED-RV64-NEXT: or a5, a6, a5
+; CHECK-ALIGNED-RV64-NEXT: xor a4, a4, a5
+; CHECK-ALIGNED-RV64-NEXT: lbu a5, 25(a0)
+; CHECK-ALIGNED-RV64-NEXT: lbu a6, 24(a0)
+; CHECK-ALIGNED-RV64-NEXT: lbu a7, 26(a0)
+; CHECK-ALIGNED-RV64-NEXT: lbu t0, 27(a0)
+; CHECK-ALIGNED-RV64-NEXT: slli a5, a5, 8
+; CHECK-ALIGNED-RV64-NEXT: or a5, a5, a6
+; CHECK-ALIGNED-RV64-NEXT: slli a7, a7, 16
+; CHECK-ALIGNED-RV64-NEXT: slli t0, t0, 24
+; CHECK-ALIGNED-RV64-NEXT: or a6, t0, a7
+; CHECK-ALIGNED-RV64-NEXT: or a5, a6, a5
+; CHECK-ALIGNED-RV64-NEXT: lbu a6, 25(a1)
+; CHECK-ALIGNED-RV64-NEXT: lbu a7, 24(a1)
+; CHECK-ALIGNED-RV64-NEXT: lbu t0, 26(a1)
+; CHECK-ALIGNED-RV64-NEXT: lbu t1, 27(a1)
+; CHECK-ALIGNED-RV64-NEXT: slli a6, a6, 8
+; CHECK-ALIGNED-RV64-NEXT: or a6, a6, a7
+; CHECK-ALIGNED-RV64-NEXT: slli t0, t0, 16
+; CHECK-ALIGNED-RV64-NEXT: slli t1, t1, 24
+; CHECK-ALIGNED-RV64-NEXT: or a7, t1, t0
+; CHECK-ALIGNED-RV64-NEXT: lbu t0, 29(a0)
+; CHECK-ALIGNED-RV64-NEXT: lbu t1, 28(a0)
+; CHECK-ALIGNED-RV64-NEXT: or a6, a7, a6
+; CHECK-ALIGNED-RV64-NEXT: xor a5, a5, a6
+; CHECK-ALIGNED-RV64-NEXT: slli t0, t0, 8
+; CHECK-ALIGNED-RV64-NEXT: or a6, t0, t1
+; CHECK-ALIGNED-RV64-NEXT: lbu a7, 29(a1)
+; CHECK-ALIGNED-RV64-NEXT: lbu t0, 28(a1)
+; CHECK-ALIGNED-RV64-NEXT: lbu a0, 30(a0)
+; CHECK-ALIGNED-RV64-NEXT: lbu a1, 30(a1)
+; CHECK-ALIGNED-RV64-NEXT: slli a7, a7, 8
+; CHECK-ALIGNED-RV64-NEXT: or a7, a7, t0
+; CHECK-ALIGNED-RV64-NEXT: xor a6, a6, a7
+; CHECK-ALIGNED-RV64-NEXT: xor a0, a0, a1
+; CHECK-ALIGNED-RV64-NEXT: or a2, a2, a3
+; CHECK-ALIGNED-RV64-NEXT: or a4, a4, a5
+; CHECK-ALIGNED-RV64-NEXT: or a0, a6, a0
+; CHECK-ALIGNED-RV64-NEXT: or a2, a2, a4
+; CHECK-ALIGNED-RV64-NEXT: ...
[truncated]
|
Created using spr 1.3.6-beta.1
Looks like some SPEC benchmarks use memcmp, I'm benchmarking them with this patch now. I'll update you when it finishes :) |
TTI::MemCmpExpansionOptions Options; | ||
// FIXME: Vector haven't been tested. | ||
Options.AllowOverlappingLoads = | ||
(ST->enableUnalignedScalarMem() || ST->enableUnalignedVectorMem()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless I'm missing something, there is no check for sufficient alignment on any load generated by ExpandMemCmp, as a result, we can only legally provide a non-byte size if unaligned access is fully supported.
(This is really bug in ExpandMemCmp - it should be asking if the load is properly aligned.)
Created using spr 1.3.6-beta.1 [skip ci]
Created using spr 1.3.6-beta.1
Created using spr 1.3.6-beta.1
Created using spr 1.3.6-beta.1
This is perhaps more of a comment for #107824 than for this one, but I think we'd benefit from some level of test coverage for OptForSize to demonstrate that we stick with the libcall in cases where expanding increases code size. |
Created using spr 1.3.6-beta.1 [skip ci]
Created using spr 1.3.6-beta.1
Thanks! Done! |
Cheers! It strikes me that we should probably be less aggressive about expanding memcmp for at least |
I collected the stats on the number of memcmps that were inlined, it looks like we're able to expand a good chunk of them:
There's a small difference in the number of original memcmp calls, there's some merge commits in this branch which might have changed the codegen slightly in the meantime. I'm working on getting some runtime numbers now, sorry for the delay |
No need to say sorry, I really appreciate your help! 😃 |
The run just finished, I'm seeing a 0.75% improvement on 500.perlbench_r on the BPI F3 (-O3 -mcpu=spacemit-x60), no regressions or improvements on the other benchmarks as far as I can see. Seems to check out with the number of memcmps inlined reported for perlbench! |
Thanks a lot! The result is within my expectation. |
; CHECK-ALIGNED-RV32-NEXT: call bcmp | ||
; CHECK-ALIGNED-RV32-NEXT: lw ra, 12(sp) # 4-byte Folded Reload | ||
; CHECK-ALIGNED-RV32-NEXT: addi sp, sp, 16 | ||
; CHECK-ALIGNED-RV32-NEXT: lbu a2, 1(a0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be cheaper to Xor all the bytes individually and then Or the xor results together?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It costs the same number of instructions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was proposing
lbu a2, 1(a0)
lbu a0, 0(a0)
lbu a3, 1(a1)
lbu a1, 0(a1)
xor a2, a2, a3
xor a0, a0, a1
or a2, a2, a0
snez a2, a2
that looks like fewer instructions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! This is an optimization we can do in ExpandMemcmp, I will take a look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to be more complicated than what I was thinking. For this case, we generate i16 loads/compares, and then we expand them since unaligned loads are illegal. We may expand memcmp to byte loop directly when there is no unaligned access.
llvm/test/CodeGen/RISCV/memcmp.ll
Outdated
; CHECK-UNALIGNED-RV32-NEXT: call memcmp | ||
; CHECK-UNALIGNED-RV32-NEXT: lw ra, 12(sp) # 4-byte Folded Reload | ||
; CHECK-UNALIGNED-RV32-NEXT: addi sp, sp, 16 | ||
; CHECK-UNALIGNED-RV32-NEXT: lw a0, 0(a0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this code doing? It seems way more complicated than a 4 byte memcmp should be. In fact it's longer than the ALIGNED version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this is an expanded byteswap?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a RUN line with Zbb/Zbkb?
Maybe we should restrict MemCmp expansion to only when Zbb/Zbkb are are enabled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
It seems we can benefit from not only rev8
instructions but also pack
instructions (for forming large integers).
But I don't think we should disable the expansion when Zbb/Zbkb don't exist. If these extensions are not supported and we don't expand memcmp, we would still execute these instructions to do same operations in glibc, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The library call will do a byte loop without Zbb I think? I don't think it will do the byte swap with shift/and/or.
So the question is, is doing a wide load and an emulated byteswap cheaper than the byte loop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question, I don't know the answer, but I think we can be aggressive here and enable it for all configurations. If we have some evidences that shows inefficiencies, we can go back to here and disable it. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we test the non-Zbb config on qemu and get the instruction count?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the code of memcmp copied from glibc: https://godbolt.org/z/4KxPTE6q1
There are many cases (which means branches) in this general implementation, at least we can benefit from unrolling and removal of branches.
Does spacemit-x60 support unaligned scalar memory and was your test with or without that enabled? |
It supports unaligned scalar but not unaligned vector. And it seems we don't add these features to |
I think X86 only supports General memcmp is complicated with vector. You need to find the first element where the mismatch occurred and then compare only that element to find whether it is larger or smaller. I don't know if you can write that in target independent IR without it being really nasty and it wouldn't generate efficient code. |
Thanks! I didn't notice that, I will try to implement this similarly in RISC-V.
Yes, I have noticed it. I was thinking that if we generate vector code earlier, we may get more spaces for optimizations. I will still try this approach, but I won't hold out high hopes. |
Created using spr 1.3.6-beta.1 [skip ci]
Created using spr 1.3.6-beta.1
Created using spr 1.3.6-beta.1 [skip ci]
Created using spr 1.3.6-beta.1
Created using spr 1.3.6-beta.1 [skip ci]
Created using spr 1.3.6-beta.1
Created using spr 1.3.6-beta.1 [skip ci]
Created using spr 1.3.6-beta.1
Ping, any comment for current scalar part? I'm working on vector expansion and will post it in a few days. |
define i1 @bcmp_lt_zero(ptr %s1, ptr %s2) nounwind optsize { | ||
; CHECK-LABEL: bcmp_lt_zero: | ||
; CHECK: # %bb.0: # %entry | ||
; CHECK-NEXT: li a0, 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happened here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It expands to:
entry:
%0 = load i32, ptr %s1, align 1
%1 = load i32, ptr %s2, align 1
%2 = icmp ne i32 %0, %1
%3 = zext i1 %2 to i32
%ret = icmp slt i32 %3, 0
ret i1 %ret
And is optimized to :
entry:
%0 = load i32, ptr %s1, align 1
%1 = load i32, ptr %s2, align 1
%2 = icmp ne i32 %0, %1
%3 = zext i1 %2 to i32
ret i1 false
This is how bcmp
is expanded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess llvm is assuming that the bcmp should only be checked for eq/ne 0 and not gt/lt 0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if this is the assumption of bcmp
or just the behavior of ExpandMemcmp
. I will keep these tests.
; CHECK-ALIGNED-RV64-NEXT: lbu a2, 0(a0) | ||
; CHECK-ALIGNED-RV64-NEXT: lbu a3, 1(a0) | ||
; CHECK-ALIGNED-RV64-NEXT: lbu a4, 2(a0) | ||
; CHECK-ALIGNED-RV64-NEXT: lb a0, 3(a0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why didn't this use a lbu to avoid the andi later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to be related to https://reviews.llvm.org/D130397. Maybe we should revert this change or add a pattern to catch this.
At a macro level, it looks like ExpandMemCmp is making some problematic choices around unaligned loads and stores. As I commented before, ExpandMemCmp appears to be blindly emitting unaligned accesses (counted as one against budget) without accounting for the fact that such loads are going to be scalarized again (i.e. resulting in N x loads, where N is the type size). I think we need to fix this. In particular, the discussion around Zbb and Zbkb in this review seem to mostly come from cases where unaligned load.store are being expanded implicitly, I don't believe this change should move forward until the underlying issue in ExpandMemCmp has been addressed. |
Can we break the enabling down into more manageable pieces? I think |
I'd be fine with this type of approach. |
Created using spr 1.3.6-beta.1
Thanks! I have applied this suggestion. For |
I meant we should start with only doing inline memcmp expansion under those conditions. Not just the overlapping loads part. The miscounting of loads that get split needs to be fixed before we can enable expansion without enableUnalignedScalarMem(). |
Created using spr 1.3.6-beta.1
Oops, sorry I misunderstood it. Done. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but wait for another review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Created using spr 1.3.6-beta.1 [skip ci]
I can't find any official documentation on this, but from other discussions[^1] and my own testing the spacemit-x60 seems to support unaligned scalar loads and stores. They seem to be performant, and just from a quick test we get a 2.45% speedup on 500.perlbench_r on the Banana Pi F3[^2]. This would allow it to take advantage of #107548. [^1]: #110454 (comment) [^2]: https://lnt.lukelau.me/db_default/v4/nts/32
There are two passes that have dependency on the implementation
of
TargetTransformInfo::enableMemCmpExpansion
:MergeICmps
andExpandMemCmp
.This PR adds the initial implementation of
enableMemCmpExpansion
so that we can have some basic benefits from these two passes.
We don't enable expansion when there is no unaligned access support
currently because there are some issues about unaligned loads and
stores in
ExpandMemcmp
pass. We should fix these issues and enablethe expansion later.
Vector case hasn't been tested as we don't generate inlined vector
instructions for memcmp currently.