-
Notifications
You must be signed in to change notification settings - Fork 14.8k
[AArch64][SME] Implement the SME ABI (ZA state management) in Machine IR #149062
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@llvm/pr-subscribers-backend-aarch64 Author: Benjamin Maxwell (MacDue) ChangesShort SummaryThis patch adds a new pass Long DescriptionThis patch reimplements management of ZA state for functions with private and shared ZA state. Agnostic ZA functions will be handled in a later patch. For now, this is under the flag The approach taken here is to mark instructions as needing ZA to be in a specific ("ACTIVE" or "LOCAL_SAVED"). Machine instructions implicitly defining or using ZA registers (such as $zt0 or $zab0) require the "ACTIVE" state. Function calls may need the "LOCAL_SAVED" or "ACTIVE" state depending on the callee (having shared or private ZA). We already add ZA register uses/definitions to machine instructions, so no extra work is needed to mark these. Calls need to be marked by glueing Arch64ISD::INOUT_ZA_USE or Arch64ISD::REQUIRES_ZA_SAVE to the CALLSEQ_START. These markers are then used by the MachineSMEABIPass to find instructions where there is a transition between required ZA states. These are the points we need to insert code to set up or restore a ZA save (or initialize ZA). To handle control flow between blocks (which may have different ZA state requirements), we bundle the incoming and outgoing edges of blocks. Bundles are formed by assigning each block an incoming and outgoing bundle (initially, all blocks have their own two bundles). Bundles are then combined by joining the outgoing bundle of a block with the incoming bundle of all successors. These bundles are then assigned a ZA state based on the blocks that participate in the bundle. Blocks whose incoming edges are in a bundle "vote" for a ZA state that matches the state required at the first instruction in the block, and likewise, blocks whose outgoing edges are in a bundle vote for the ZA state that matches the last instruction in the block. The ZA state with the most votes is used, which aims to minimize the number of state transitions. Patch is 216.21 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/149062.diff 23 Files Affected:
diff --git a/llvm/lib/Target/AArch64/AArch64.h b/llvm/lib/Target/AArch64/AArch64.h
index 5496ebd495a55..8d0ff41fc8c08 100644
--- a/llvm/lib/Target/AArch64/AArch64.h
+++ b/llvm/lib/Target/AArch64/AArch64.h
@@ -60,6 +60,7 @@ FunctionPass *createAArch64CleanupLocalDynamicTLSPass();
FunctionPass *createAArch64CollectLOHPass();
FunctionPass *createSMEABIPass();
FunctionPass *createSMEPeepholeOptPass();
+FunctionPass *createMachineSMEABIPass();
ModulePass *createSVEIntrinsicOptsPass();
InstructionSelector *
createAArch64InstructionSelector(const AArch64TargetMachine &,
@@ -111,6 +112,7 @@ void initializeFalkorMarkStridedAccessesLegacyPass(PassRegistry&);
void initializeLDTLSCleanupPass(PassRegistry&);
void initializeSMEABIPass(PassRegistry &);
void initializeSMEPeepholeOptPass(PassRegistry &);
+void initializeMachineSMEABIPass(PassRegistry &);
void initializeSVEIntrinsicOptsPass(PassRegistry &);
void initializeAArch64Arm64ECCallLoweringPass(PassRegistry &);
} // end namespace llvm
diff --git a/llvm/lib/Target/AArch64/AArch64ExpandPseudoInsts.cpp b/llvm/lib/Target/AArch64/AArch64ExpandPseudoInsts.cpp
index 7de66ccbf6f29..c0d118aa3afed 100644
--- a/llvm/lib/Target/AArch64/AArch64ExpandPseudoInsts.cpp
+++ b/llvm/lib/Target/AArch64/AArch64ExpandPseudoInsts.cpp
@@ -92,8 +92,8 @@ class AArch64ExpandPseudo : public MachineFunctionPass {
bool expandCALL_BTI(MachineBasicBlock &MBB, MachineBasicBlock::iterator MBBI);
bool expandStoreSwiftAsyncContext(MachineBasicBlock &MBB,
MachineBasicBlock::iterator MBBI);
- MachineBasicBlock *expandRestoreZA(MachineBasicBlock &MBB,
- MachineBasicBlock::iterator MBBI);
+ MachineBasicBlock *expandCommitOrRestoreZA(MachineBasicBlock &MBB,
+ MachineBasicBlock::iterator MBBI);
MachineBasicBlock *expandCondSMToggle(MachineBasicBlock &MBB,
MachineBasicBlock::iterator MBBI);
};
@@ -974,9 +974,13 @@ bool AArch64ExpandPseudo::expandStoreSwiftAsyncContext(
}
MachineBasicBlock *
-AArch64ExpandPseudo::expandRestoreZA(MachineBasicBlock &MBB,
- MachineBasicBlock::iterator MBBI) {
+AArch64ExpandPseudo::expandCommitOrRestoreZA(MachineBasicBlock &MBB,
+ MachineBasicBlock::iterator MBBI) {
MachineInstr &MI = *MBBI;
+ bool IsRestoreZA = MI.getOpcode() == AArch64::RestoreZAPseudo;
+ assert((MI.getOpcode() == AArch64::RestoreZAPseudo ||
+ MI.getOpcode() == AArch64::CommitZAPseudo) &&
+ "Expected ZA commit or restore");
assert((std::next(MBBI) != MBB.end() ||
MI.getParent()->successors().begin() !=
MI.getParent()->successors().end()) &&
@@ -984,21 +988,23 @@ AArch64ExpandPseudo::expandRestoreZA(MachineBasicBlock &MBB,
// Compare TPIDR2_EL0 value against 0.
DebugLoc DL = MI.getDebugLoc();
- MachineInstrBuilder Cbz = BuildMI(MBB, MBBI, DL, TII->get(AArch64::CBZX))
- .add(MI.getOperand(0));
+ MachineInstrBuilder Branch =
+ BuildMI(MBB, MBBI, DL,
+ TII->get(IsRestoreZA ? AArch64::CBZX : AArch64::CBNZX))
+ .add(MI.getOperand(0));
// Split MBB and create two new blocks:
// - MBB now contains all instructions before RestoreZAPseudo.
- // - SMBB contains the RestoreZAPseudo instruction only.
- // - EndBB contains all instructions after RestoreZAPseudo.
+ // - SMBB contains the [Commit|RestoreZA]Pseudo instruction only.
+ // - EndBB contains all instructions after [Commit|RestoreZA]Pseudo.
MachineInstr &PrevMI = *std::prev(MBBI);
MachineBasicBlock *SMBB = MBB.splitAt(PrevMI, /*UpdateLiveIns*/ true);
MachineBasicBlock *EndBB = std::next(MI.getIterator()) == SMBB->end()
? *SMBB->successors().begin()
: SMBB->splitAt(MI, /*UpdateLiveIns*/ true);
- // Add the SMBB label to the TB[N]Z instruction & create a branch to EndBB.
- Cbz.addMBB(SMBB);
+ // Add the SMBB label to the CB[N]Z instruction & create a branch to EndBB.
+ Branch.addMBB(SMBB);
BuildMI(&MBB, DL, TII->get(AArch64::B))
.addMBB(EndBB);
MBB.addSuccessor(EndBB);
@@ -1006,8 +1012,12 @@ AArch64ExpandPseudo::expandRestoreZA(MachineBasicBlock &MBB,
// Replace the pseudo with a call (BL).
MachineInstrBuilder MIB =
BuildMI(*SMBB, SMBB->end(), DL, TII->get(AArch64::BL));
- MIB.addReg(MI.getOperand(1).getReg(), RegState::Implicit);
- for (unsigned I = 2; I < MI.getNumOperands(); ++I)
+ unsigned FirstBLOperand = 1;
+ if (IsRestoreZA) {
+ MIB.addReg(MI.getOperand(1).getReg(), RegState::Implicit);
+ FirstBLOperand = 2;
+ }
+ for (unsigned I = FirstBLOperand; I < MI.getNumOperands(); ++I)
MIB.add(MI.getOperand(I));
BuildMI(SMBB, DL, TII->get(AArch64::B)).addMBB(EndBB);
@@ -1617,8 +1627,9 @@ bool AArch64ExpandPseudo::expandMI(MachineBasicBlock &MBB,
return expandCALL_BTI(MBB, MBBI);
case AArch64::StoreSwiftAsyncContext:
return expandStoreSwiftAsyncContext(MBB, MBBI);
+ case AArch64::CommitZAPseudo:
case AArch64::RestoreZAPseudo: {
- auto *NewMBB = expandRestoreZA(MBB, MBBI);
+ auto *NewMBB = expandCommitOrRestoreZA(MBB, MBBI);
if (NewMBB != &MBB)
NextMBBI = MBB.end(); // The NextMBBI iterator is invalidated.
return true;
@@ -1629,6 +1640,8 @@ bool AArch64ExpandPseudo::expandMI(MachineBasicBlock &MBB,
NextMBBI = MBB.end(); // The NextMBBI iterator is invalidated.
return true;
}
+ case AArch64::InOutZAUsePseudo:
+ case AArch64::RequiresZASavePseudo:
case AArch64::COALESCER_BARRIER_FPR16:
case AArch64::COALESCER_BARRIER_FPR32:
case AArch64::COALESCER_BARRIER_FPR64:
diff --git a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
index 4f13a14d24649..49135d05b689b 100644
--- a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
+++ b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
@@ -8154,53 +8154,54 @@ SDValue AArch64TargetLowering::LowerFormalArguments(
if (Subtarget->hasCustomCallingConv())
Subtarget->getRegisterInfo()->UpdateCustomCalleeSavedRegs(MF);
- // Create a 16 Byte TPIDR2 object. The dynamic buffer
- // will be expanded and stored in the static object later using a pseudonode.
- if (Attrs.hasZAState()) {
- TPIDR2Object &TPIDR2 = FuncInfo->getTPIDR2Obj();
- TPIDR2.FrameIndex = MFI.CreateStackObject(16, Align(16), false);
- SDValue SVL = DAG.getNode(AArch64ISD::RDSVL, DL, MVT::i64,
- DAG.getConstant(1, DL, MVT::i32));
-
- SDValue Buffer;
- if (!Subtarget->isTargetWindows() && !hasInlineStackProbe(MF)) {
- Buffer = DAG.getNode(AArch64ISD::ALLOCATE_ZA_BUFFER, DL,
- DAG.getVTList(MVT::i64, MVT::Other), {Chain, SVL});
- } else {
- SDValue Size = DAG.getNode(ISD::MUL, DL, MVT::i64, SVL, SVL);
- Buffer = DAG.getNode(ISD::DYNAMIC_STACKALLOC, DL,
- DAG.getVTList(MVT::i64, MVT::Other),
- {Chain, Size, DAG.getConstant(1, DL, MVT::i64)});
- MFI.CreateVariableSizedObject(Align(16), nullptr);
- }
- Chain = DAG.getNode(
- AArch64ISD::INIT_TPIDR2OBJ, DL, DAG.getVTList(MVT::Other),
- {/*Chain*/ Buffer.getValue(1), /*Buffer ptr*/ Buffer.getValue(0)});
- } else if (Attrs.hasAgnosticZAInterface()) {
- // Call __arm_sme_state_size().
- SDValue BufferSize =
- DAG.getNode(AArch64ISD::GET_SME_SAVE_SIZE, DL,
- DAG.getVTList(MVT::i64, MVT::Other), Chain);
- Chain = BufferSize.getValue(1);
-
- SDValue Buffer;
- if (!Subtarget->isTargetWindows() && !hasInlineStackProbe(MF)) {
- Buffer =
- DAG.getNode(AArch64ISD::ALLOC_SME_SAVE_BUFFER, DL,
- DAG.getVTList(MVT::i64, MVT::Other), {Chain, BufferSize});
- } else {
- // Allocate space dynamically.
- Buffer = DAG.getNode(
- ISD::DYNAMIC_STACKALLOC, DL, DAG.getVTList(MVT::i64, MVT::Other),
- {Chain, BufferSize, DAG.getConstant(1, DL, MVT::i64)});
- MFI.CreateVariableSizedObject(Align(16), nullptr);
+ if (!Subtarget->useNewSMEABILowering() || Attrs.hasAgnosticZAInterface()) {
+ // Old SME ABI lowering (deprecated):
+ // Create a 16 Byte TPIDR2 object. The dynamic buffer
+ // will be expanded and stored in the static object later using a
+ // pseudonode.
+ if (Attrs.hasZAState()) {
+ TPIDR2Object &TPIDR2 = FuncInfo->getTPIDR2Obj();
+ TPIDR2.FrameIndex = MFI.CreateStackObject(16, Align(16), false);
+ SDValue SVL = DAG.getNode(AArch64ISD::RDSVL, DL, MVT::i64,
+ DAG.getConstant(1, DL, MVT::i32));
+ SDValue Buffer;
+ if (!Subtarget->isTargetWindows() && !hasInlineStackProbe(MF)) {
+ Buffer = DAG.getNode(AArch64ISD::ALLOCATE_ZA_BUFFER, DL,
+ DAG.getVTList(MVT::i64, MVT::Other), {Chain, SVL});
+ } else {
+ SDValue Size = DAG.getNode(ISD::MUL, DL, MVT::i64, SVL, SVL);
+ Buffer = DAG.getNode(ISD::DYNAMIC_STACKALLOC, DL,
+ DAG.getVTList(MVT::i64, MVT::Other),
+ {Chain, Size, DAG.getConstant(1, DL, MVT::i64)});
+ MFI.CreateVariableSizedObject(Align(16), nullptr);
+ }
+ Chain = DAG.getNode(
+ AArch64ISD::INIT_TPIDR2OBJ, DL, DAG.getVTList(MVT::Other),
+ {/*Chain*/ Buffer.getValue(1), /*Buffer ptr*/ Buffer.getValue(0)});
+ } else if (Attrs.hasAgnosticZAInterface()) {
+ // Call __arm_sme_state_size().
+ SDValue BufferSize =
+ DAG.getNode(AArch64ISD::GET_SME_SAVE_SIZE, DL,
+ DAG.getVTList(MVT::i64, MVT::Other), Chain);
+ Chain = BufferSize.getValue(1);
+ SDValue Buffer;
+ if (!Subtarget->isTargetWindows() && !hasInlineStackProbe(MF)) {
+ Buffer = DAG.getNode(AArch64ISD::ALLOC_SME_SAVE_BUFFER, DL,
+ DAG.getVTList(MVT::i64, MVT::Other),
+ {Chain, BufferSize});
+ } else {
+ // Allocate space dynamically.
+ Buffer = DAG.getNode(
+ ISD::DYNAMIC_STACKALLOC, DL, DAG.getVTList(MVT::i64, MVT::Other),
+ {Chain, BufferSize, DAG.getConstant(1, DL, MVT::i64)});
+ MFI.CreateVariableSizedObject(Align(16), nullptr);
+ }
+ // Copy the value to a virtual register, and save that in FuncInfo.
+ Register BufferPtr =
+ MF.getRegInfo().createVirtualRegister(&AArch64::GPR64RegClass);
+ FuncInfo->setSMESaveBufferAddr(BufferPtr);
+ Chain = DAG.getCopyToReg(Chain, DL, BufferPtr, Buffer);
}
-
- // Copy the value to a virtual register, and save that in FuncInfo.
- Register BufferPtr =
- MF.getRegInfo().createVirtualRegister(&AArch64::GPR64RegClass);
- FuncInfo->setSMESaveBufferAddr(BufferPtr);
- Chain = DAG.getCopyToReg(Chain, DL, BufferPtr, Buffer);
}
if (CallConv == CallingConv::PreserveNone) {
@@ -8217,6 +8218,15 @@ SDValue AArch64TargetLowering::LowerFormalArguments(
}
}
+ if (Subtarget->useNewSMEABILowering()) {
+ // Clear new ZT0 state. TODO: Move this to the SME ABI pass.
+ if (Attrs.isNewZT0())
+ Chain = DAG.getNode(
+ ISD::INTRINSIC_VOID, DL, MVT::Other, Chain,
+ DAG.getConstant(Intrinsic::aarch64_sme_zero_zt, DL, MVT::i32),
+ DAG.getTargetConstant(0, DL, MVT::i32));
+ }
+
return Chain;
}
@@ -8781,14 +8791,12 @@ static SDValue emitSMEStateSaveRestore(const AArch64TargetLowering &TLI,
MachineFunction &MF = DAG.getMachineFunction();
AArch64FunctionInfo *FuncInfo = MF.getInfo<AArch64FunctionInfo>();
FuncInfo->setSMESaveBufferUsed();
-
TargetLowering::ArgListTy Args;
TargetLowering::ArgListEntry Entry;
Entry.Ty = PointerType::getUnqual(*DAG.getContext());
Entry.Node =
DAG.getCopyFromReg(Chain, DL, Info->getSMESaveBufferAddr(), MVT::i64);
Args.push_back(Entry);
-
SDValue Callee =
DAG.getExternalSymbol(IsSave ? "__arm_sme_save" : "__arm_sme_restore",
TLI.getPointerTy(DAG.getDataLayout()));
@@ -8906,6 +8914,9 @@ AArch64TargetLowering::LowerCall(CallLoweringInfo &CLI,
*DAG.getContext());
RetCCInfo.AnalyzeCallResult(Ins, RetCC);
+ // Determine whether we need any streaming mode changes.
+ SMECallAttrs CallAttrs = getSMECallAttrs(MF.getFunction(), CLI);
+
// Check callee args/returns for SVE registers and set calling convention
// accordingly.
if (CallConv == CallingConv::C || CallConv == CallingConv::Fast) {
@@ -8919,14 +8930,26 @@ AArch64TargetLowering::LowerCall(CallLoweringInfo &CLI,
CallConv = CallingConv::AArch64_SVE_VectorCall;
}
+ bool UseNewSMEABILowering = Subtarget->useNewSMEABILowering();
+ bool IsAgnosticZAFunction = CallAttrs.caller().hasAgnosticZAInterface();
+ auto ZAMarkerNode = [&]() -> std::optional<unsigned> {
+ // TODO: Handle agnostic ZA functions.
+ if (!UseNewSMEABILowering || IsAgnosticZAFunction)
+ return std::nullopt;
+ if (!CallAttrs.caller().hasZAState() && !CallAttrs.caller().hasZT0State())
+ return std::nullopt;
+ return CallAttrs.requiresLazySave() ? AArch64ISD::REQUIRES_ZA_SAVE
+ : AArch64ISD::INOUT_ZA_USE;
+ }();
+
if (IsTailCall) {
// Check if it's really possible to do a tail call.
IsTailCall = isEligibleForTailCallOptimization(CLI);
// A sibling call is one where we're under the usual C ABI and not planning
// to change that but can still do a tail call:
- if (!TailCallOpt && IsTailCall && CallConv != CallingConv::Tail &&
- CallConv != CallingConv::SwiftTail)
+ if (!ZAMarkerNode.has_value() && !TailCallOpt && IsTailCall &&
+ CallConv != CallingConv::Tail && CallConv != CallingConv::SwiftTail)
IsSibCall = true;
if (IsTailCall)
@@ -8978,9 +9001,6 @@ AArch64TargetLowering::LowerCall(CallLoweringInfo &CLI,
assert(FPDiff % 16 == 0 && "unaligned stack on tail call");
}
- // Determine whether we need any streaming mode changes.
- SMECallAttrs CallAttrs = getSMECallAttrs(MF.getFunction(), CLI);
-
auto DescribeCallsite =
[&](OptimizationRemarkAnalysis &R) -> OptimizationRemarkAnalysis & {
R << "call from '" << ore::NV("Caller", MF.getName()) << "' to '";
@@ -8994,7 +9014,7 @@ AArch64TargetLowering::LowerCall(CallLoweringInfo &CLI,
return R;
};
- bool RequiresLazySave = CallAttrs.requiresLazySave();
+ bool RequiresLazySave = !UseNewSMEABILowering && CallAttrs.requiresLazySave();
bool RequiresSaveAllZA = CallAttrs.requiresPreservingAllZAState();
if (RequiresLazySave) {
const TPIDR2Object &TPIDR2 = FuncInfo->getTPIDR2Obj();
@@ -9076,10 +9096,21 @@ AArch64TargetLowering::LowerCall(CallLoweringInfo &CLI,
AArch64ISD::SMSTOP, DL, DAG.getVTList(MVT::Other, MVT::Glue), Chain,
DAG.getTargetConstant((int32_t)(AArch64SVCR::SVCRZA), DL, MVT::i32));
- // Adjust the stack pointer for the new arguments...
+ // Adjust the stack pointer for the new arguments... and mark ZA uses.
// These operations are automatically eliminated by the prolog/epilog pass
- if (!IsSibCall)
+ assert((!IsSibCall || !ZAMarkerNode.has_value()) &&
+ "ZA markers require CALLSEQ_START");
+ if (!IsSibCall) {
Chain = DAG.getCALLSEQ_START(Chain, IsTailCall ? 0 : NumBytes, 0, DL);
+ if (ZAMarkerNode) {
+ // Note: We need the CALLSEQ_START to glue the ZAMarkerNode to, simply
+ // using a chain can result in incorrect scheduling. The markers referer
+ // to the position just before the CALLSEQ_START (though occur after as
+ // CALLSEQ_START lacks in-glue).
+ Chain = DAG.getNode(*ZAMarkerNode, DL, DAG.getVTList(MVT::Other),
+ {Chain, Chain.getValue(1)});
+ }
+ }
SDValue StackPtr = DAG.getCopyFromReg(Chain, DL, AArch64::SP,
getPointerTy(DAG.getDataLayout()));
@@ -9551,7 +9582,7 @@ AArch64TargetLowering::LowerCall(CallLoweringInfo &CLI,
}
}
- if (CallAttrs.requiresEnablingZAAfterCall())
+ if (RequiresLazySave || CallAttrs.requiresEnablingZAAfterCall())
// Unconditionally resume ZA.
Result = DAG.getNode(
AArch64ISD::SMSTART, DL, DAG.getVTList(MVT::Other, MVT::Glue), Result,
@@ -9572,7 +9603,6 @@ AArch64TargetLowering::LowerCall(CallLoweringInfo &CLI,
SDValue TPIDR2_EL0 = DAG.getNode(
ISD::INTRINSIC_W_CHAIN, DL, MVT::i64, Result,
DAG.getConstant(Intrinsic::aarch64_sme_get_tpidr2, DL, MVT::i32));
-
// Copy the address of the TPIDR2 block into X0 before 'calling' the
// RESTORE_ZA pseudo.
SDValue Glue;
@@ -9584,7 +9614,6 @@ AArch64TargetLowering::LowerCall(CallLoweringInfo &CLI,
DAG.getNode(AArch64ISD::RESTORE_ZA, DL, MVT::Other,
{Result, TPIDR2_EL0, DAG.getRegister(AArch64::X0, MVT::i64),
RestoreRoutine, RegMask, Result.getValue(1)});
-
// Finally reset the TPIDR2_EL0 register to 0.
Result = DAG.getNode(
ISD::INTRINSIC_VOID, DL, MVT::Other, Result,
diff --git a/llvm/lib/Target/AArch64/AArch64ISelLowering.h b/llvm/lib/Target/AArch64/AArch64ISelLowering.h
index 6afb3c330d25b..72897a0446ca2 100644
--- a/llvm/lib/Target/AArch64/AArch64ISelLowering.h
+++ b/llvm/lib/Target/AArch64/AArch64ISelLowering.h
@@ -173,6 +173,10 @@ class AArch64TargetLowering : public TargetLowering {
MachineBasicBlock *EmitZTInstr(MachineInstr &MI, MachineBasicBlock *BB,
unsigned Opcode, bool Op0IsDef) const;
MachineBasicBlock *EmitZero(MachineInstr &MI, MachineBasicBlock *BB) const;
+
+ // Note: The following group of functions are only used as part of the old SME
+ // ABI lowering. They will be removed once -aarch64-new-sme-abi=true is the
+ // default.
MachineBasicBlock *EmitInitTPIDR2Object(MachineInstr &MI,
MachineBasicBlock *BB) const;
MachineBasicBlock *EmitAllocateZABuffer(MachineInstr &MI,
diff --git a/llvm/lib/Target/AArch64/AArch64MachineFunctionInfo.h b/llvm/lib/Target/AArch64/AArch64MachineFunctionInfo.h
index 800787cc0b4f5..3f6980fe11aea 100644
--- a/llvm/lib/Target/AArch64/AArch64MachineFunctionInfo.h
+++ b/llvm/lib/Target/AArch64/AArch64MachineFunctionInfo.h
@@ -213,9 +213,6 @@ class AArch64FunctionInfo final : public MachineFunctionInfo {
/// or return type
bool IsSVECC = false;
- /// The frame-index for the TPIDR2 object used for lazy saves.
- TPIDR2Object TPIDR2;
-
/// Whether this function changes streaming mode within the function.
bool HasStreamingModeChanges = false;
@@ -231,14 +228,6 @@ class AArch64FunctionInfo final : public MachineFunctionInfo {
// on function entry to record the initial pstate of a function.
Register PStateSMReg = MCRegister::NoRegister;
- // Holds a pointer to a buffer that is large enough to represent
- // all SME ZA state and any additional state required by the
- // __arm_sme_save/restore support routines.
- Register SMESaveBufferAddr = MCRegister::NoRegister;
-
- // true if SMESaveBufferAddr is used.
- bool SMESaveBufferUsed = false;
-
// Has the PNReg used to build PTRUE instruction.
// The PTRUE is used for the LD/ST of ZReg pairs in save and restore.
unsigned PredicateRegForFillSpill = 0;
@@ -250,6 +239,16 @@ class AArch64FunctionInfo final : public MachineFunctionInfo {
// Holds the SME function attributes (streaming mode, ZA/ZT0 state).
SMEAttrs SMEFnAttrs;
+ // Note: The following properties are only used for the old SME ABI lowering:
+ /// The frame-index for the TPIDR2 object used for lazy saves.
+ TPIDR2Object TPIDR2;
+ // Holds a pointer to a buffer that is large enough to represent
+ // all SME ZA state and any additional state required by the
+ // __arm_sme_save/restore support routines.
+ Register SMESaveBufferAddr = MCRegister::NoRegister;
+ // true if SMESaveBufferAddr is used.
+ bool SMESaveBufferUsed = false;
+
public:
AArch64FunctionInfo(const Function &F, const AArch64Subtarget *STI);
@@ -258,6 +257,13 @@ class AArch64FunctionInfo final : public MachineFunctionInfo {
const DenseMap<MachineBasicBlock *, MachineBasicBlock *> &Src2DstMBB)
const override;
+ // Old SME ABI lowering state getters/setters:
+...
[truncated]
|
// TODO: We should propagate desired incoming/outgoing states through blocks | ||
// that have the "ANY" state first to make better global decisions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a fairly important TODO
. As for larger functions where most blocks don't have defined ZA states (as they don't contain any instructions that require ZA to be in a specific state), this initial implementation can place saves/restores somewhat poorly (particularly in regards to nested loops).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think something like #149510, would mostly resolve this.
26d884c
to
ab9c3a0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is mainly a high level structural review, because I've yet to properly check the finer details of the emit... functions (although I'm guessing they're mostly taken from the existing implementation?).
What I've seen does look very nice.
😄 |
55bb8ab
to
6a0ee9d
Compare
OFF, | ||
|
||
// The number of ZA states (not a valid state) | ||
NUM_ZA_STATE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm yet to really look at the implementation, but my first high-level question is: Why do we need more states than the ABI specifies? I expected to see:
- OFF
- ACTIVE
- DORMANT
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two reasons:
- It's not statically possible to know the ABI state compile time
LOCAL_SAVED
isOFF or DORMANT
(after a lazy save as been set up)- N.B. The
CALLER_DORMANT
state is a little bit of a misnomer as it's also (OFF or DORMANT
)
- These states control code generation, which needs more specific states than what the ABI specifies
- E.g., ZA can be
DORMANT
on entry to a function and after a lazy save has been set up, but the code generated for the transition to theACTIVE
state is different in both situations - This becomes more apparent in later PRs (not all upstream) that add support for more specialized cases
- E.g., ZA can be
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the pointers :) I guess it will become clearer to me as I browse through the code. My current expentations are:
- In a
za_agnostic
orza_preserved
function, we can enter in any state. We can exit in any state. - In a
za_inout
function, we are in theactive
state on entry and exit. - In a
za_private
function, we can enter indormant || off
state. We can exit in any state.
In all those cases, there is nothing to do if we call a za_agnostic
, za_inout
or za_preserved
function. But if calling a za_private
function, we might need to setup a lazy-save buffer (when the current state is active
).
AFAIU, LOCAL_SAVED
and CALLER_DORMANT
can both mean dormant || off
. The former is on the caller-side and might need to setup a lazy-save buffer, while the latter is on the callee-side, and might need to commit the lazy-save. I would find pseudo states like OFF_OR_DORMANT_SETUP
and OFF_OR_DORMANT_COMMIT
clearer, as they explicitly say in what state we are, and what codegen is expected in case of a transition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- In a za_agnostic or za_preserved function, we can enter in any state. We can exit in any state.
These two cases are pretty much the same as za_inout
. Though za_agnostic
can be entered with ZA = OFF
, we have to assume it's active and emit code to ensure ZA is preserved. Agnostic ZA is not implemented in this PR (there's a later patch that adds basic support, then a [currently] downstream change for EH support).
So as far as code-gen is concerned we enter/exit in the active
state.
- In a za_inout function, we are in the active state on entry and exit.
Yep 👍
- In a za_private function, we can enter in dormant || off state. We can exit in any state.
We always exit with ZA off
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
%0:gpr32common = COPY $w0 | ||
ADJCALLSTACKDOWN 0, 0, implicit-def dead $sp, implicit $sp | ||
RequiresZASavePseudo | ||
BL @clobber, csr_aarch64_aapcs, implicit-def dead $lr, implicit $sp, implicit-def $sp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again a high-level question to help me understand the implementation: Given that csr_aarch64_aapcs
is just a RegMask
operand, why can't we define a new mask which also clobbers $za
? If we could spill/reload that register through storeRegToStackSlot
/loadRegFromStackSlot
and have the register as an actual part of the ABI, then we would get a lot for free?
I'm guessing we can't do any of that because the register is reserved? Do you know what it would cost to "unreserve" $za
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've thought about this more in the past, but these are a few reasons I can remember off the top of my head:
- The SME ABI for lazily saving/restoring ZA does not behave like a simple load/store, so you have to be more careful about how actions are placed than a regular spill/fill (you can't "reload" from a single spill twice, for example)
- The action needed can depend on more information than a "should spill" or "should fill"
- E.g., depending on how ZA is clobbered, you may need to commit a save or just set up a lazy save
- Note: More specific cases are not implemented yet (as these first PRs are for the basic functionality)
- They introduce function calls, which may not be safe/nice to insert in
storeRegToStackSlot
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The bundles stuff is new to me so I'm putting my faith in them working as described. Other than that I'm mostly happy with the current state of this prototype and just have a few final comments, questions and observations.
; } | ||
; | ||
; In this example we don't setup a lazy save before shared_za_call(), however, | ||
; we still enter the catch block in a ZA off state. This leads to use emitting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
; we still enter the catch block in a ZA off state. This leads to use emitting | |
; we still enter the catch block in a ZA off state. This leads to us emitting |
Are there plans to fix this? I don't like the idea of reading random data from the stack.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've fixed this in a later patch (not upstream yet) for agnostic ZA functions (where ZA is required to be preserved on all normal exits). I've not done this for shared_za
functions as it's not strictly required (this behavior matches GCC), but I could fix it in the same way.
The fix would result in it restoring ZA to the state before the shared_za call that throws the exception, though depending on that would still be UB (and not compatible with GCC).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. Yes, let's remember to do that for shared_za
as well because I'm not keen to replicate GCC's behaviour here.
3a3a90b
to
c4c12b9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm done with my initial round. What I've seen does make a lot of sense, but I'll need to do another more thorough review one once I get some answers :) Overall, I think what would help is a bit more MIR tests for the new pass, and the AArch64PseudoLowering pass, as .ll can be hard to read due to the amount of unnecessary asm.
OFF, | ||
|
||
// The number of ZA states (not a valid state) | ||
NUM_ZA_STATE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the pointers :) I guess it will become clearer to me as I browse through the code. My current expentations are:
- In a
za_agnostic
orza_preserved
function, we can enter in any state. We can exit in any state. - In a
za_inout
function, we are in theactive
state on entry and exit. - In a
za_private
function, we can enter indormant || off
state. We can exit in any state.
In all those cases, there is nothing to do if we call a za_agnostic
, za_inout
or za_preserved
function. But if calling a za_private
function, we might need to setup a lazy-save buffer (when the current state is active
).
AFAIU, LOCAL_SAVED
and CALLER_DORMANT
can both mean dormant || off
. The former is on the caller-side and might need to setup a lazy-save buffer, while the latter is on the callee-side, and might need to commit the lazy-save. I would find pseudo states like OFF_OR_DORMANT_SETUP
and OFF_OR_DORMANT_COMMIT
clearer, as they explicitly say in what state we are, and what codegen is expected in case of a transition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a couple more comments, and I do understand a lot more about what we are trying to achieve here. I think I'll be able to do an actual thorough review tomorrow 😄
emitStateChange(MBB, Inst.InsertPt, CurrentState, Inst.NeededState, | ||
Inst.PhysLiveRegs); | ||
CurrentState = Inst.NeededState; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the approach, I think it's great to first collect all the required states, and them emit the right code if a transition is detected. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps I'm suffering from re-re-re-review blindness but nothing is jumping out as obviously wrong. Given it's a default off work-in-progress feature I'm happy for this to be the starting point to evolve from.
<h1>Short Summary</h1> This patch adds a new pass `aarch64-machine-sme-abi` to handle the ABI for ZA state (e.g., lazy saves and agnostic ZA functions). This is currently not enabled by default (but aims to be by LLVM 22). The goal is for this new pass to more optimally place ZA saves/restores and to work with exception handling. <h1>Long Description</h1> This patch reimplements management of ZA state for functions with private and shared ZA state. Agnostic ZA functions will be handled in a later patch. For now, this is under the flag `-aarch64-new-sme-abi`, however, we intend for this to replace the current SelectionDAG implementation once complete. The approach taken here is to mark instructions as needing ZA to be in a specific ("ACTIVE" or "LOCAL_SAVED"). Machine instructions implicitly defining or using ZA registers (such as $zt0 or $zab0) require the "ACTIVE" state. Function calls may need the "LOCAL_SAVED" or "ACTIVE" state depending on the callee (having shared or private ZA). We already add ZA register uses/definitions to machine instructions, so no extra work is needed to mark these. Calls need to be marked by glueing Arch64ISD::INOUT_ZA_USE or Arch64ISD::REQUIRES_ZA_SAVE to the CALLSEQ_START. These markers are then used by the MachineSMEABIPass to find instructions where there is a transition between required ZA states. These are the points we need to insert code to set up or restore a ZA save (or initialize ZA). To handle control flow between blocks (which may have different ZA state requirements), we bundle the incoming and outgoing edges of blocks. Bundles are formed by assigning each block an incoming and outgoing bundle (initially, all blocks have their own two bundles). Bundles are then combined by joining the outgoing bundle of a block with the incoming bundle of all successors. These bundles are then assigned a ZA state based on the blocks that participate in the bundle. Blocks whose incoming edges are in a bundle "vote" for a ZA state that matches the state required at the first instruction in the block, and likewise, blocks whose outgoing edges are in a bundle vote for the ZA state that matches the last instruction in the block. The ZA state with the most votes is used, which aims to minimize the number of state transitions. Change-Id: Iced4a3f329deab3ff8f3fd449a2337f7bbfa71ec
Change-Id: I8b50f54a8eab9e78ef40c8002eb1768cb8e0bdb7
Change-Id: I0316ef997160ed0232a72422c75d7180f3c09150
Change-Id: I6660e11a5e85250e135c93c08c15399230a623bb
0d29d28
to
b354229
Compare
Same here, so please go ahead if that makes your stack of changes easier to manage :) With my limited SME and ZA management knowledge, I cannot comment on whether this new pass is the right approach or how much it's better than the previous approach, so I'll trust @paulwalker-arm and @sdesmalen-arm on that. What I can say is that the changes do make sense to me, and this is a great starting point to enabling better ZA "stack management". Edit: Things I would be happy to be considered in follow-ups:
|
Thanks for the reviews 🚀 I'll also add #149062 (comment) (resuming from exceptions thrown in shared ZA calls) to the list of follow-ups. Alongside my own #149062 (comment) (which might overlap with #149062 (comment) -- which I think could achieve a similar result). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good
LLVM Buildbot has detected a new failure on builder Full details are available at: https://lab.llvm.org/buildbot/#/builders/175/builds/23684 Here is the relevant piece of the build log for the reference
|
I'll push a fix for the buildbots shortly: https://github.com/llvm/llvm-project/pull/154295/files |
LLVM Buildbot has detected a new failure on builder Full details are available at: https://lab.llvm.org/buildbot/#/builders/8/builds/19779 Here is the relevant piece of the build log for the reference
|
LLVM Buildbot has detected a new failure on builder Full details are available at: https://lab.llvm.org/buildbot/#/builders/174/builds/23034 Here is the relevant piece of the build log for the reference
|
LLVM Buildbot has detected a new failure on builder Full details are available at: https://lab.llvm.org/buildbot/#/builders/195/builds/13431 Here is the relevant piece of the build log for the reference
|
LLVM Buildbot has detected a new failure on builder Full details are available at: https://lab.llvm.org/buildbot/#/builders/88/builds/15240 Here is the relevant piece of the build log for the reference
|
LLVM Buildbot has detected a new failure on builder Full details are available at: https://lab.llvm.org/buildbot/#/builders/168/builds/15359 Here is the relevant piece of the build log for the reference
|
LLVM Buildbot has detected a new failure on builder Full details are available at: https://lab.llvm.org/buildbot/#/builders/95/builds/17226 Here is the relevant piece of the build log for the reference
|
LLVM Buildbot has detected a new failure on builder Full details are available at: https://lab.llvm.org/buildbot/#/builders/45/builds/15555 Here is the relevant piece of the build log for the reference
|
LLVM Buildbot has detected a new failure on builder Full details are available at: https://lab.llvm.org/buildbot/#/builders/76/builds/12238 Here is the relevant piece of the build log for the reference
|
LLVM Buildbot has detected a new failure on builder Full details are available at: https://lab.llvm.org/buildbot/#/builders/202/builds/2922 Here is the relevant piece of the build log for the reference
|
LLVM Buildbot has detected a new failure on builder Full details are available at: https://lab.llvm.org/buildbot/#/builders/42/builds/5852 Here is the relevant piece of the build log for the reference
|
@@ -0,0 +1,288 @@ | |||
; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5 | |||
; RUN: llc -mtriple=aarch64-linux-gnu -mattr=+sme -aarch64-new-sme-abi < %s | FileCheck %s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This testcase fails if llc is built with EXPENSIVE_CHECKS, or if you justr add "-verify-machineinstrs" to the RUN line:
*** Bad machine code: Using an undefined physical register ***
- function: za_with_raii
- basic block: %bb.1 throw_exception (0x563ee44cad28)
- instruction: %12:gpr64 = COPY $x0
- operand 1: $x0
LLVM ERROR: Found 1 machine code errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks 👍 I'm aware of this issue and have a fix that'll be posting shortly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #154325
LLVM Buildbot has detected a new failure on builder Full details are available at: https://lab.llvm.org/buildbot/#/builders/136/builds/5015 Here is the relevant piece of the build log for the reference
|
LLVM Buildbot has detected a new failure on builder Full details are available at: https://lab.llvm.org/buildbot/#/builders/127/builds/4466 Here is the relevant piece of the build log for the reference
|
LLVM Buildbot has detected a new failure on builder Full details are available at: https://lab.llvm.org/buildbot/#/builders/130/builds/15021 Here is the relevant piece of the build log for the reference
|
#154325) These liveins are not defined by predecessors, so should not be considered as liveouts in predecessor blocks. This resolves: - #149062 (comment) - #153417 (comment)
…ing liveouts (#154325) These liveins are not defined by predecessors, so should not be considered as liveouts in predecessor blocks. This resolves: - llvm/llvm-project#149062 (comment) - llvm/llvm-project#153417 (comment)
LLVM Buildbot has detected a new failure on builder Full details are available at: https://lab.llvm.org/buildbot/#/builders/49/builds/2076 Here is the relevant piece of the build log for the reference
|
Short Summary
This patch adds a new pass
aarch64-machine-sme-abi
to handle the ABI for ZA state (e.g., lazy saves and agnostic ZA functions). This is currently not enabled by default (but aims to be by LLVM 22). The goal is for this new pass to more optimally place ZA saves/restores and to work with exception handling.Long Description
This patch reimplements management of ZA state for functions with private and shared ZA state. Agnostic ZA functions will be handled in a later patch. For now, this is under the flag
-aarch64-new-sme-abi
, however, we intend for this to replace the current SelectionDAG implementation once complete.The approach taken here is to mark instructions as needing ZA to be in a specific ("ACTIVE" or "LOCAL_SAVED"). Machine instructions implicitly defining or using ZA registers (such as $zt0 or $zab0) require the "ACTIVE" state. Function calls may need the "LOCAL_SAVED" or "ACTIVE" state depending on the callee (having shared or private ZA).
We already add ZA register uses/definitions to machine instructions, so no extra work is needed to mark these.
Calls need to be marked by glueing Arch64ISD::INOUT_ZA_USE or Arch64ISD::REQUIRES_ZA_SAVE to the CALLSEQ_START.
These markers are then used by the MachineSMEABIPass to find instructions where there is a transition between required ZA states. These are the points we need to insert code to set up or restore a ZA save (or initialize ZA).
To handle control flow between blocks (which may have different ZA state requirements), we bundle the incoming and outgoing edges of blocks. Bundles are formed by assigning each block an incoming and outgoing bundle (initially, all blocks have their own two bundles). Bundles are then combined by joining the outgoing bundle of a block with the incoming bundle of all successors.
These bundles are then assigned a ZA state based on the blocks that participate in the bundle. Blocks whose incoming edges are in a bundle "vote" for a ZA state that matches the state required at the first instruction in the block, and likewise, blocks whose outgoing edges are in a bundle vote for the ZA state that matches the last instruction in the block. The ZA state with the most votes is used, which aims to minimize the number of state transitions.