[AMDGPU] Improve isBasicBlockPrologue helper function #69924

cdevadas · 2023-10-23T12:48:01Z

This function helps to skip over the basic block prolog
instructions while inserting a new spill or a copy during
liverange split. However, this appears to be incomplete.
It currently skips only the instructions that restores
the exec mask. It should also have skipped over the BB
prolog spills and the liverange split copies inserted
earlier during regalloc while dealing with the BBLiveIns.

This patch partially addresses the limitation by including
only the spills in the query. The live-range-split COPY
handling is pending.

This helper function is also used at multiple places to
skip over the bb prolog. They should continue to produce
functionally correct code even after this patch.

This patch depends on #69923

llvmbot · 2023-10-23T12:49:15Z

@llvm/pr-subscribers-llvm-globalisel

@llvm/pr-subscribers-backend-amdgpu

Author: Christudasan Devadasan (cdevadas)

Changes

This function helps to skip over the basic block prolog
instructions while inserting a new spill or a copy during
liverange split. However, this appears to be incomplete.
It currently skips only the instructions that restores
the exec mask. It should also have skipped over the BB
prolog spills and the liverange split copies inserted
earlier during regalloc while dealing with the BBLiveIns.

This patch partially addresses the limitation by including
only the spills in the query. The live-range-split COPY
handling is pending.

This helper function is also used at multiple places to
skip over the bb prolog. They should continue to produce
functionally correct code even after this patch.

This patch depends on #69923

Patch is 413.71 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/69924.diff

47 Files Affected:

(modified) llvm/lib/Target/AMDGPU/SIFrameLowering.cpp (+5-7)
(modified) llvm/lib/Target/AMDGPU/SIInstrInfo.cpp (+21-4)
(modified) llvm/lib/Target/AMDGPU/SIInstrInfo.h (+5)
(modified) llvm/lib/Target/AMDGPU/SIInstructions.td (+19)
(modified) llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp (+5-5)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/image-waterfall-loop-O0.ll (+3-2)
(modified) llvm/test/CodeGen/AMDGPU/av_spill_cross_bb_usage.mir (+4-4)
(modified) llvm/test/CodeGen/AMDGPU/collapse-endcf.ll (+41-30)
(modified) llvm/test/CodeGen/AMDGPU/control-flow-fastregalloc.ll (+8-6)
(modified) llvm/test/CodeGen/AMDGPU/cross-block-use-is-not-abi-copy.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/csr-sgpr-spill-live-ins.mir (+4-4)
(modified) llvm/test/CodeGen/AMDGPU/extend-wwm-virt-reg-liveness.mir (+8-8)
(modified) llvm/test/CodeGen/AMDGPU/mubuf-legalize-operands-non-ptr-intrinsics.ll (+30-24)
(modified) llvm/test/CodeGen/AMDGPU/mubuf-legalize-operands.ll (+32-26)
(modified) llvm/test/CodeGen/AMDGPU/partial-sgpr-to-vgpr-spills.ll (+19-14)
(modified) llvm/test/CodeGen/AMDGPU/pei-scavenge-sgpr-carry-out.mir (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/pei-scavenge-sgpr-gfx9.mir (+5-4)
(modified) llvm/test/CodeGen/AMDGPU/pei-scavenge-sgpr.mir (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/pei-scavenge-vgpr-spill.mir (+8-6)
(modified) llvm/test/CodeGen/AMDGPU/preserve-only-inactive-lane.mir (+4-4)
(modified) llvm/test/CodeGen/AMDGPU/same-slot-agpr-sgpr.mir (+4-4)
(modified) llvm/test/CodeGen/AMDGPU/sgpr-spill-dead-frame-in-dbg-value.mir (+4-4)
(modified) llvm/test/CodeGen/AMDGPU/sgpr-spill-no-vgprs.ll (+4-3)
(modified) llvm/test/CodeGen/AMDGPU/sgpr-spill-partially-undef.mir (+4-4)
(modified) llvm/test/CodeGen/AMDGPU/sgpr-spill-to-vmem-scc-clobber.mir (+16-16)
(modified) llvm/test/CodeGen/AMDGPU/sgpr-spill-vmem-large-frame.mir (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/sgpr-spill.mir (+445-441)
(modified) llvm/test/CodeGen/AMDGPU/si-lower-sgpr-spills.mir (+8-8)
(modified) llvm/test/CodeGen/AMDGPU/snippet-copy-bundle-regression.mir (+24-24)
(modified) llvm/test/CodeGen/AMDGPU/spill-reg-tuple-super-reg-use.mir (+8-8)
(modified) llvm/test/CodeGen/AMDGPU/spill-scavenge-offset.ll (+244-245)
(modified) llvm/test/CodeGen/AMDGPU/spill-sgpr-csr-live-ins.mir (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/spill-sgpr-to-virtual-vgpr.mir (+106-106)
(modified) llvm/test/CodeGen/AMDGPU/spill-special-sgpr.mir (+20-18)
(modified) llvm/test/CodeGen/AMDGPU/spill-writelane-vgprs.ll (-8)
(modified) llvm/test/CodeGen/AMDGPU/spill192.mir (+14-12)
(modified) llvm/test/CodeGen/AMDGPU/spill224.mir (+16-14)
(modified) llvm/test/CodeGen/AMDGPU/spill288.mir (+20-18)
(modified) llvm/test/CodeGen/AMDGPU/spill320.mir (+22-20)
(modified) llvm/test/CodeGen/AMDGPU/spill352.mir (+24-22)
(modified) llvm/test/CodeGen/AMDGPU/spill384.mir (+26-24)
(modified) llvm/test/CodeGen/AMDGPU/swdev380865.ll (+8-8)
(modified) llvm/test/CodeGen/AMDGPU/tied-op-for-wwm-scratch-reg-spill-restore.mir (+12-12)
(modified) llvm/test/CodeGen/AMDGPU/track-spilled-vgpr-liveness.mir (-18)
(modified) llvm/test/CodeGen/AMDGPU/use_restore_frame_reg.mir (+5-4)
(modified) llvm/test/CodeGen/AMDGPU/vgpr-spill-placement-issue61083.ll (+2-1)
(modified) llvm/test/CodeGen/AMDGPU/wwm-reserved-spill.ll (+3-3)

diff --git a/llvm/lib/Target/AMDGPU/SIFrameLowering.cpp b/llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
index 5de993ec3cba1d8..1d855d9bd83157d 100644
--- a/llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
@@ -274,7 +274,7 @@ class PrologEpilogSGPRSpillBuilder {
       Register SubReg = NumSubRegs == 1
                             ? SuperReg
                             : Register(TRI.getSubReg(SuperReg, SplitParts[I]));
-      BuildMI(MBB, MI, DL, TII->get(AMDGPU::V_WRITELANE_B32), Spill[I].VGPR)
+      BuildMI(MBB, MI, DL, TII->get(AMDGPU::SI_WRITELANE_PSEUDO), Spill[I].VGPR)
           .addReg(SubReg)
           .addImm(Spill[I].Lane)
           .addReg(Spill[I].VGPR, RegState::Undef);
@@ -319,7 +319,7 @@ class PrologEpilogSGPRSpillBuilder {
       Register SubReg = NumSubRegs == 1
                             ? SuperReg
                             : Register(TRI.getSubReg(SuperReg, SplitParts[I]));
-      BuildMI(MBB, MI, DL, TII->get(AMDGPU::V_READLANE_B32), SubReg)
+      BuildMI(MBB, MI, DL, TII->get(AMDGPU::SI_READLANE_PSEUDO), SubReg)
           .addReg(Spill[I].VGPR)
           .addImm(Spill[I].Lane);
     }
@@ -1554,12 +1554,10 @@ void SIFrameLowering::determineCalleeSaves(MachineFunction &MF,
       // TODO: Handle this elsewhere at an early point. Walking through all MBBs
       // here would be a bad heuristic. A better way should be by calling
       // allocateWWMSpill during the regalloc pipeline whenever a physical
-      // register is allocated for the intended virtual registers. That will
-      // also help excluding the general use of WRITELANE/READLANE intrinsics
-      // that won't really need any such special handling.
-      if (MI.getOpcode() == AMDGPU::V_WRITELANE_B32)
+      // register is allocated for the intended virtual registers.
+      if (MI.getOpcode() == AMDGPU::SI_WRITELANE_PSEUDO)
         MFI->allocateWWMSpill(MF, MI.getOperand(0).getReg());
-      else if (MI.getOpcode() == AMDGPU::V_READLANE_B32)
+      else if (MI.getOpcode() == AMDGPU::SI_READLANE_PSEUDO)
         MFI->allocateWWMSpill(MF, MI.getOperand(1).getReg());
       else if (TII->isWWMRegSpillOpcode(MI.getOpcode()))
         NeedExecCopyReservedReg = true;
diff --git a/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp b/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
index 4ff7b462f0f3295..f977c2412bbe322 100644
--- a/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
@@ -2090,6 +2090,14 @@ bool SIInstrInfo::expandPostRAPseudo(MachineInstr &MI) const {
     MI.setDesc(get(AMDGPU::S_AND_SAVEEXEC_B32));
     break;
 
+  case AMDGPU::SI_WRITELANE_PSEUDO:
+    MI.setDesc(get(AMDGPU::V_WRITELANE_B32));
+    break;
+
+  case AMDGPU::SI_READLANE_PSEUDO:
+    MI.setDesc(get(AMDGPU::V_READLANE_B32));
+    break;
+
   case AMDGPU::V_MOV_B64_PSEUDO: {
     Register Dst = MI.getOperand(0).getReg();
     Register DstLo = RI.getSubReg(Dst, AMDGPU::sub0);
@@ -3907,7 +3915,9 @@ bool SIInstrInfo::hasUnwantedEffectsWhenEXECEmpty(const MachineInstr &MI) const
   // However, executing them with EXEC = 0 causes them to operate on undefined
   // data, which we avoid by returning true here.
   if (Opcode == AMDGPU::V_READFIRSTLANE_B32 ||
-      Opcode == AMDGPU::V_READLANE_B32 || Opcode == AMDGPU::V_WRITELANE_B32)
+      Opcode == AMDGPU::V_READLANE_B32 || Opcode == AMDGPU::V_WRITELANE_B32 ||
+      Opcode == AMDGPU::SI_READLANE_PSEUDO ||
+      Opcode == AMDGPU::SI_WRITELANE_PSEUDO)
     return true;
 
   return false;
@@ -4301,7 +4311,9 @@ static bool shouldReadExec(const MachineInstr &MI) {
   if (SIInstrInfo::isVALU(MI)) {
     switch (MI.getOpcode()) {
     case AMDGPU::V_READLANE_B32:
+    case AMDGPU::SI_READLANE_PSEUDO:
     case AMDGPU::V_WRITELANE_B32:
+    case AMDGPU::SI_WRITELANE_PSEUDO:
       return false;
     }
 
@@ -8326,8 +8338,11 @@ unsigned SIInstrInfo::getLiveRangeSplitOpcode(Register SrcReg,
 }
 
 bool SIInstrInfo::isBasicBlockPrologue(const MachineInstr &MI) const {
-  return !MI.isTerminator() && MI.getOpcode() != AMDGPU::COPY &&
-         MI.modifiesRegister(AMDGPU::EXEC, &RI);
+  uint16_t Opc = MI.getOpcode();
+  // FIXME: Copies inserted in the block prolog for live-range split should also
+  // be included.
+  return (isSpillOpcode(Opc) || (!MI.isTerminator() && Opc != AMDGPU::COPY &&
+                                 MI.modifiesRegister(AMDGPU::EXEC, &RI)));
 }
 
 MachineInstrBuilder
@@ -8970,7 +8985,9 @@ SIInstrInfo::getInstructionUniformity(const MachineInstr &MI) const {
     return InstructionUniformity::NeverUniform;
 
   unsigned opcode = MI.getOpcode();
-  if (opcode == AMDGPU::V_READLANE_B32 || opcode == AMDGPU::V_READFIRSTLANE_B32)
+  if (opcode == AMDGPU::V_READLANE_B32 ||
+      opcode == AMDGPU::V_READFIRSTLANE_B32 ||
+      opcode == AMDGPU::SI_READLANE_PSEUDO)
     return InstructionUniformity::AlwaysUniform;
 
   if (isCopyInstr(MI)) {
diff --git a/llvm/lib/Target/AMDGPU/SIInstrInfo.h b/llvm/lib/Target/AMDGPU/SIInstrInfo.h
index 5ef17c44f7de389..34453c41109580d 100644
--- a/llvm/lib/Target/AMDGPU/SIInstrInfo.h
+++ b/llvm/lib/Target/AMDGPU/SIInstrInfo.h
@@ -671,6 +671,11 @@ class SIInstrInfo final : public AMDGPUGenInstrInfo {
     return get(Opcode).TSFlags & SIInstrFlags::SGPRSpill;
   }
 
+  bool isSpillOpcode(uint16_t Opcode) const {
+    return get(Opcode).TSFlags &
+           (SIInstrFlags::SGPRSpill | SIInstrFlags::VGPRSpill);
+  }
+
   static bool isWWMRegSpillOpcode(uint16_t Opcode) {
     return Opcode == AMDGPU::SI_SPILL_WWM_V32_SAVE ||
            Opcode == AMDGPU::SI_SPILL_WWM_AV32_SAVE ||
diff --git a/llvm/lib/Target/AMDGPU/SIInstructions.td b/llvm/lib/Target/AMDGPU/SIInstructions.td
index 567f1b812c1808c..6929dc4f4ab91cc 100644
--- a/llvm/lib/Target/AMDGPU/SIInstructions.td
+++ b/llvm/lib/Target/AMDGPU/SIInstructions.td
@@ -875,6 +875,25 @@ defm SI_SPILL_S384 : SI_SPILL_SGPR <SReg_384>;
 defm SI_SPILL_S512 : SI_SPILL_SGPR <SReg_512>;
 defm SI_SPILL_S1024 : SI_SPILL_SGPR <SReg_1024>;
 
+let SGPRSpill = 1 in {
+def SI_WRITELANE_PSEUDO : PseudoInstSI <(outs VGPR_32:$vdst),
+  (ins SReg_32:$src0, i32imm:$src1, VGPR_32:$vdst_in)> {
+  let hasSideEffects = 0;
+  let mayLoad = 0;
+  let mayStore = 0;
+  let VALU = 1;
+  let Constraints = "$vdst = $vdst_in";
+}
+
+def SI_READLANE_PSEUDO : PseudoInstSI <(outs SReg_32:$sdst),
+  (ins VGPR_32:$src0, i32imm:$src1)> {
+  let hasSideEffects = 0;
+  let mayLoad = 0;
+  let mayStore = 0;
+  let VALU = 1;
+}
+} // End SGPRSpill = 1
+
 // VGPR or AGPR spill instructions. In case of AGPR spilling a temp register
 // needs to be used and an extra instruction to move between VGPR and AGPR.
 // UsesTmp adds to the total size of an expanded spill in this case.
diff --git a/llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp b/llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
index d0a81673d6528c2..ed81fac6886adca 100644
--- a/llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
@@ -1769,7 +1769,7 @@ bool SIRegisterInfo::spillSGPR(MachineBasicBlock::iterator MI, int Index,
       // Mark the "old value of vgpr" input undef only if this is the first sgpr
       // spill to this specific vgpr in the first basic block.
       auto MIB = BuildMI(*SB.MBB, MI, SB.DL,
-                         SB.TII.get(AMDGPU::V_WRITELANE_B32), Spill.VGPR)
+                         SB.TII.get(AMDGPU::SI_WRITELANE_PSEUDO), Spill.VGPR)
                      .addReg(SubReg, getKillRegState(UseKill))
                      .addImm(Spill.Lane)
                      .addReg(Spill.VGPR);
@@ -1815,7 +1815,7 @@ bool SIRegisterInfo::spillSGPR(MachineBasicBlock::iterator MI, int Index,
                 : Register(getSubReg(SB.SuperReg, SB.SplitParts[i]));
 
         MachineInstrBuilder WriteLane =
-            BuildMI(*SB.MBB, MI, SB.DL, SB.TII.get(AMDGPU::V_WRITELANE_B32),
+            BuildMI(*SB.MBB, MI, SB.DL, SB.TII.get(AMDGPU::SI_WRITELANE_PSEUDO),
                     SB.TmpVGPR)
                 .addReg(SubReg, SubKillState)
                 .addImm(i % PVD.PerVGPR)
@@ -1877,8 +1877,8 @@ bool SIRegisterInfo::restoreSGPR(MachineBasicBlock::iterator MI, int Index,
               : Register(getSubReg(SB.SuperReg, SB.SplitParts[i]));
 
       SpilledReg Spill = VGPRSpills[i];
-      auto MIB = BuildMI(*SB.MBB, MI, SB.DL, SB.TII.get(AMDGPU::V_READLANE_B32),
-                         SubReg)
+      auto MIB = BuildMI(*SB.MBB, MI, SB.DL,
+                         SB.TII.get(AMDGPU::SI_READLANE_PSEUDO), SubReg)
                      .addReg(Spill.VGPR)
                      .addImm(Spill.Lane);
       if (SB.NumSubRegs > 1 && i == 0)
@@ -1911,7 +1911,7 @@ bool SIRegisterInfo::restoreSGPR(MachineBasicBlock::iterator MI, int Index,
 
         bool LastSubReg = (i + 1 == e);
         auto MIB = BuildMI(*SB.MBB, MI, SB.DL,
-                           SB.TII.get(AMDGPU::V_READLANE_B32), SubReg)
+                           SB.TII.get(AMDGPU::SI_READLANE_PSEUDO), SubReg)
                        .addReg(SB.TmpVGPR, getKillRegState(LastSubReg))
                        .addImm(i);
         if (SB.NumSubRegs > 1 && i == 0)
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/image-waterfall-loop-O0.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/image-waterfall-loop-O0.ll
index b19230c2e876c4f..10cbc56cc5fbea1 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/image-waterfall-loop-O0.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/image-waterfall-loop-O0.ll
@@ -144,8 +144,6 @@ define <4 x float> @waterfall_loop(<8 x i32> %vgpr_srd) {
 ; CHECK-NEXT:    buffer_store_dword v0, off, s[0:3], s32 ; 4-byte Folded Spill
 ; CHECK-NEXT:    s_mov_b32 exec_lo, s21
 ; CHECK-NEXT:  ; %bb.2: ; in Loop: Header=BB0_1 Depth=1
-; CHECK-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:4 ; 4-byte Folded Reload
-; CHECK-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:8 ; 4-byte Folded Reload
 ; CHECK-NEXT:    s_or_saveexec_b32 s21, -1
 ; CHECK-NEXT:    buffer_load_dword v2, off, s[0:3], s32 ; 4-byte Folded Reload
 ; CHECK-NEXT:    s_mov_b32 exec_lo, s21
@@ -163,6 +161,9 @@ define <4 x float> @waterfall_loop(<8 x i32> %vgpr_srd) {
 ; CHECK-NEXT:    v_readlane_b32 s17, v2, 1
 ; CHECK-NEXT:    v_readlane_b32 s18, v2, 2
 ; CHECK-NEXT:    v_readlane_b32 s19, v2, 3
+; CHECK-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:4 ; 4-byte Folded Reload
+; CHECK-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:8 ; 4-byte Folded Reload
+; CHECK-NEXT:    s_waitcnt vmcnt(0)
 ; CHECK-NEXT:    image_sample v0, v[0:1], s[8:15], s[16:19] dmask:0x1 dim:SQ_RSRC_IMG_2D
 ; CHECK-NEXT:    s_waitcnt vmcnt(0)
 ; CHECK-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:76 ; 4-byte Folded Spill
diff --git a/llvm/test/CodeGen/AMDGPU/av_spill_cross_bb_usage.mir b/llvm/test/CodeGen/AMDGPU/av_spill_cross_bb_usage.mir
index 7209d160e6c8a7a..3e56e49bf31d5cb 100644
--- a/llvm/test/CodeGen/AMDGPU/av_spill_cross_bb_usage.mir
+++ b/llvm/test/CodeGen/AMDGPU/av_spill_cross_bb_usage.mir
@@ -60,8 +60,8 @@ body:             |
   ; GCN-NEXT:   renamable $vgpr46 = COPY $vgpr1, implicit $exec
   ; GCN-NEXT:   renamable $vgpr45 = COPY $vgpr0, implicit $exec
   ; GCN-NEXT:   renamable $sgpr16_sgpr17 = IMPLICIT_DEF
-  ; GCN-NEXT:   $vgpr40 = V_WRITELANE_B32 $sgpr30, 0, $vgpr40, implicit-def $sgpr30_sgpr31, implicit $sgpr30_sgpr31
-  ; GCN-NEXT:   $vgpr40 = V_WRITELANE_B32 $sgpr31, 1, $vgpr40, implicit $sgpr30_sgpr31
+  ; GCN-NEXT:   $vgpr40 = SI_WRITELANE_PSEUDO $sgpr30, 0, $vgpr40, implicit-def $sgpr30_sgpr31, implicit $sgpr30_sgpr31
+  ; GCN-NEXT:   $vgpr40 = SI_WRITELANE_PSEUDO $sgpr31, 1, $vgpr40, implicit $sgpr30_sgpr31
   ; GCN-NEXT:   BUFFER_STORE_DWORD_OFFSET killed $vgpr14, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 52, 0, 0, implicit $exec, implicit-def $vgpr14_vgpr15, implicit $vgpr14_vgpr15 :: (store (s32) into %stack.1, addrspace 5)
   ; GCN-NEXT:   BUFFER_STORE_DWORD_OFFSET killed $vgpr15, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 56, 0, 0, implicit $exec, implicit killed $vgpr14_vgpr15 :: (store (s32) into %stack.1 + 4, addrspace 5)
   ; GCN-NEXT:   BUFFER_STORE_DWORD_OFFSET killed $vgpr10, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 60, 0, 0, implicit $exec, implicit-def $vgpr10_vgpr11, implicit $vgpr10_vgpr11 :: (store (s32) into %stack.2, addrspace 5)
@@ -124,8 +124,8 @@ body:             |
 
     ADJCALLSTACKUP 0, 0, implicit-def dead $scc, implicit-def $sgpr32, implicit $sgpr32
     renamable $sgpr16_sgpr17 = IMPLICIT_DEF
-    $vgpr40 = V_WRITELANE_B32 $sgpr30, 0, $vgpr40, implicit-def $sgpr30_sgpr31, implicit $sgpr30_sgpr31
-    $vgpr40 = V_WRITELANE_B32 killed $sgpr31, 1, $vgpr40, implicit killed $sgpr30_sgpr31
+    $vgpr40 = SI_WRITELANE_PSEUDO $sgpr30, 0, $vgpr40, implicit-def $sgpr30_sgpr31, implicit $sgpr30_sgpr31
+    $vgpr40 = SI_WRITELANE_PSEUDO killed $sgpr31, 1, $vgpr40, implicit killed $sgpr30_sgpr31
     dead $sgpr30_sgpr31 = SI_CALL killed renamable $sgpr16_sgpr17, 0, csr_amdgpu, implicit-def dead $vgpr0
     %8:vreg_64 = nofpexcept V_FMA_F64_e64 0, %7, 0, %6, 0, %5, 0, 0, implicit $mode, implicit $exec
     ADJCALLSTACKDOWN 0, 0, implicit-def dead $scc, implicit-def $sgpr32, implicit $sgpr32
diff --git a/llvm/test/CodeGen/AMDGPU/collapse-endcf.ll b/llvm/test/CodeGen/AMDGPU/collapse-endcf.ll
index 73d5088141cdb31..dc519df52919add 100644
--- a/llvm/test/CodeGen/AMDGPU/collapse-endcf.ll
+++ b/llvm/test/CodeGen/AMDGPU/collapse-endcf.ll
@@ -75,10 +75,10 @@ define amdgpu_kernel void @simple_nested_if(ptr addrspace(1) nocapture %arg) {
 ; GCN-O0-NEXT:    s_waitcnt expcnt(0)
 ; GCN-O0-NEXT:    buffer_load_dword v0, off, s[12:15], 0 offset:4 ; 4-byte Folded Reload
 ; GCN-O0-NEXT:    s_mov_b64 exec, s[8:9]
-; GCN-O0-NEXT:    buffer_load_dword v1, off, s[12:15], 0 offset:8 ; 4-byte Folded Reload
-; GCN-O0-NEXT:    s_waitcnt vmcnt(1)
+; GCN-O0-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-O0-NEXT:    v_readlane_b32 s4, v0, 0
 ; GCN-O0-NEXT:    v_readlane_b32 s5, v0, 1
+; GCN-O0-NEXT:    buffer_load_dword v1, off, s[12:15], 0 offset:8 ; 4-byte Folded Reload
 ; GCN-O0-NEXT:    s_mov_b32 s2, 0xf000
 ; GCN-O0-NEXT:    s_mov_b32 s0, 0
 ; GCN-O0-NEXT:    ; kill: def $sgpr0 killed $sgpr0 def $sgpr0_sgpr1
@@ -104,7 +104,6 @@ define amdgpu_kernel void @simple_nested_if(ptr addrspace(1) nocapture %arg) {
 ; GCN-O0-NEXT:    s_mov_b64 exec, s[0:1]
 ; GCN-O0-NEXT:    s_cbranch_execz .LBB0_3
 ; GCN-O0-NEXT:  ; %bb.2: ; %bb.inner.then
-; GCN-O0-NEXT:    buffer_load_dword v1, off, s[12:15], 0 offset:8 ; 4-byte Folded Reload
 ; GCN-O0-NEXT:    s_or_saveexec_b64 s[8:9], -1
 ; GCN-O0-NEXT:    s_waitcnt expcnt(0)
 ; GCN-O0-NEXT:    buffer_load_dword v0, off, s[12:15], 0 offset:4 ; 4-byte Folded Reload
@@ -112,7 +111,9 @@ define amdgpu_kernel void @simple_nested_if(ptr addrspace(1) nocapture %arg) {
 ; GCN-O0-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-O0-NEXT:    v_readlane_b32 s0, v0, 0
 ; GCN-O0-NEXT:    v_readlane_b32 s1, v0, 1
+; GCN-O0-NEXT:    buffer_load_dword v1, off, s[12:15], 0 offset:8 ; 4-byte Folded Reload
 ; GCN-O0-NEXT:    v_mov_b32_e32 v0, 1
+; GCN-O0-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-O0-NEXT:    v_add_i32_e64 v1, s[2:3], v1, v0
 ; GCN-O0-NEXT:    v_ashrrev_i32_e64 v3, 31, v1
 ; GCN-O0-NEXT:    ; kill: def $vgpr1 killed $vgpr1 def $vgpr1_vgpr2 killed $exec
@@ -248,10 +249,10 @@ define amdgpu_kernel void @uncollapsable_nested_if(ptr addrspace(1) nocapture %a
 ; GCN-O0-NEXT:    s_waitcnt expcnt(0)
 ; GCN-O0-NEXT:    buffer_load_dword v0, off, s[12:15], 0 offset:4 ; 4-byte Folded Reload
 ; GCN-O0-NEXT:    s_mov_b64 exec, s[8:9]
-; GCN-O0-NEXT:    buffer_load_dword v1, off, s[12:15], 0 offset:8 ; 4-byte Folded Reload
-; GCN-O0-NEXT:    s_waitcnt vmcnt(1)
+; GCN-O0-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-O0-NEXT:    v_readlane_b32 s4, v0, 0
 ; GCN-O0-NEXT:    v_readlane_b32 s5, v0, 1
+; GCN-O0-NEXT:    buffer_load_dword v1, off, s[12:15], 0 offset:8 ; 4-byte Folded Reload
 ; GCN-O0-NEXT:    s_mov_b32 s2, 0xf000
 ; GCN-O0-NEXT:    s_mov_b32 s0, 0
 ; GCN-O0-NEXT:    ; kill: def $sgpr0 killed $sgpr0 def $sgpr0_sgpr1
@@ -277,7 +278,6 @@ define amdgpu_kernel void @uncollapsable_nested_if(ptr addrspace(1) nocapture %a
 ; GCN-O0-NEXT:    s_mov_b64 exec, s[0:1]
 ; GCN-O0-NEXT:    s_cbranch_execz .LBB1_4
 ; GCN-O0-NEXT:  ; %bb.2: ; %bb.inner.then
-; GCN-O0-NEXT:    buffer_load_dword v1, off, s[12:15], 0 offset:8 ; 4-byte Folded Reload
 ; GCN-O0-NEXT:    s_or_saveexec_b64 s[8:9], -1
 ; GCN-O0-NEXT:    s_waitcnt expcnt(0)
 ; GCN-O0-NEXT:    buffer_load_dword v0, off, s[12:15], 0 offset:4 ; 4-byte Folded Reload
@@ -285,7 +285,9 @@ define amdgpu_kernel void @uncollapsable_nested_if(ptr addrspace(1) nocapture %a
 ; GCN-O0-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-O0-NEXT:    v_readlane_b32 s0, v0, 0
 ; GCN-O0-NEXT:    v_readlane_b32 s1, v0, 1
+; GCN-O0-NEXT:    buffer_load_dword v1, off, s[12:15], 0 offset:8 ; 4-byte Folded Reload
 ; GCN-O0-NEXT:    v_mov_b32_e32 v0, 1
+; GCN-O0-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-O0-NEXT:    v_add_i32_e64 v1, s[2:3], v1, v0
 ; GCN-O0-NEXT:    v_ashrrev_i32_e64 v3, 31, v1
 ; GCN-O0-NEXT:    ; kill: def $vgpr1 killed $vgpr1 def $vgpr1_vgpr2 killed $exec
@@ -311,7 +313,6 @@ define amdgpu_kernel void @uncollapsable_nested_if(ptr addrspace(1) nocapture %a
 ; GCN-O0-NEXT:    s_or_b64 exec, exec, s[0:1]
 ; GCN-O0-NEXT:    s_branch .LBB1_5
 ; GCN-O0-NEXT:  .LBB1_4: ; %bb.inner.end
-; GCN-O0-NEXT:    buffer_load_dword v1, off, s[12:15], 0 offset:8 ; 4-byte Folded Reload
 ; GCN-O0-NEXT:    s_or_saveexec_b64 s[8:9], -1
 ; GCN-O0-NEXT:    s_waitcnt expcnt(0)
 ; GCN-O0-NEXT:    buffer_load_dword v0, off, s[12:15], 0 offset:4 ; 4-byte Folded Reload
@@ -322,7 +323,9 @@ define amdgpu_kernel void @uncollapsable_nested_if(ptr addrspace(1) nocapture %a
 ; GCN-O0-NEXT:    s_or_b64 exec, exec, s[2:3]
 ; GCN-O0-NEXT:    v_readlane_b32 s0, v0, 0
 ; GCN-O0-NEXT:    v_readlane_b32 s1, v0, 1
+; GCN-O0-NEXT:    buffer_load_dword v1, off, s[12:15], 0 offset:8 ; 4-byte Folded Reload
 ; GCN-O0-NEXT:    v_mov_b32_e32 v0, 2
+; GCN-O0-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-O0-NEXT:    v_add_i32_e64 v1, s[2:3], v1, v0
 ; GCN-O0-NEXT:    v_ashrrev_i32_e64 v3, 31, v1
 ; GCN-O0-NEXT:    ; kill: def $vgpr1 killed $vgpr1 def $vgpr1_vgpr2 killed $exec
@@ -508,7 +511,6 @@ define amdgpu_kernel void @nested_if_if_else(ptr addrspace(1) nocapture %arg) {
 ; GCN-O0-NEXT:    s_xor_b64 exec, exec, s[0:1]
 ; GCN-O0-NEXT:    s_cbranch_execz .LBB2_5
 ; GCN-O0-NEXT:  ; %bb.3: ; %bb.then
-; GCN-O0-NEXT:    buffer_load_dword v1, off, s[12:15], 0 offset:8 ; 4-byte Folded Reload
 ; GCN-O0-NEXT:    s_or_saveexec_b64 s[6:7], -1
 ; GCN-O0-NEXT:    s_waitcnt expcnt(0)
 ; GCN-O0-NEXT:    buffer_load_dword v0, off, s[12:15], 0 offset:4 ; 4-byte Folded Reload
@@ -516,7 +518,9 @@ define amdgpu_kernel void @nested_if_if_else(ptr addrspace(1) nocapture %arg) {
 ; GCN-O0-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-O0-NEXT:    v_readlane_b32 s0, v0, 0
 ; GCN-O0-NEXT:    v_readlane_b32 s1, v0, 1
+; GCN-O0-NEXT:    buffer_load_dword v1, off, s[12:15], 0 offset:8 ; 4-byte Folded Reload
 ; GCN-O0-NEXT:    v_mov_b32_e32 v0, 1
+; GCN-O0-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-O0-NEXT:    v_add_i32_e64 v1, s[2:3], v1, v0
 ; GCN-O0-NEXT:    v_ashrrev_i32_e64 v3, 31, v1
 ; GCN-O0-NEXT:    ; kill: def $vgpr1 killed $vgpr1 def $vgpr1_vgpr2 killed $exec
@@ -532,7 +536,6 @@ define amdgpu_kernel void @nested_if_if_else(ptr addrspace(1) nocapture %arg) {
 ; GCN-O0-NEXT:    buffer_store_dword v0, v[1:2], s[0:3], 0 addr64
 ; GCN-O0-NEXT:    s_branch .LBB2_5
 ; GCN-O0-NEXT:  .LBB2_4: ; %bb.else
-; GCN-O0-NEXT:    buffer_load_dword v1, off, s[12:15], 0 offset:8 ; 4-byte Folded Reload
 ; GCN-O0-NEXT:    s_or_saveexec_b64 s[6:7], -1
 ; GCN-O0-NEXT:    s_waitcnt expcnt(0)
 ; GCN-O0-NEXT:    buffer_load_dword v0, off, s[12:15], 0 offset:4 ; 4-byte Folded Reload
@@ -540,7 +543,9 @@ define amdgpu_kernel void @nested_if_if_else(ptr addrspace(1) nocapture %arg) {
 ; GCN-O0-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-O0-NEXT:    v_readlane_b32 s0, v0, 0
 ; GCN-O0-NEXT:    v_readlane_b32 s1, v0, 1
+; GCN-O0-NEXT:    buffer_load_dword v1, off, s[12:15], 0 offset:8 ; 4-byte Folded Reload
 ; GCN-O0-NEXT:    v_mov_b32_e32 v0, 2
+; GCN-O0-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-O0-NEXT:    v_add_i32_e64 v1, s[2:3], v1, v0
 ; GCN-O0-NEXT:    v_ashrrev_i32_e64 v3, 31, v1
 ; GCN-O0-NEXT:    ; kill: def $vgpr1 killed $vgpr1 def $vgpr1_vgpr2 killed $exec
@@ -954,20 +959,21 @@ define amdgpu_kernel void @s_endpgm_unsafe_barrier(ptr addrspace(1) nocapture %a
 ; GCN-O0-NEXT:    s_mov_b64 exec, s[0:1]
 ; GCN-O0-NEXT:    s_cbranch_execz .LBB4_2
 ; GCN-O0-NEXT:  ; %bb.1: ; %bb.then
-; GCN-O0-NEXT:    s_waitcnt...
[truncated]

ruiling · 2023-10-23T15:11:39Z

Would you mind giving more details on the issue you are trying to solve? Did you hit the problem in any real case? Does this change has kind of relation with the change: https://reviews.llvm.org/D145329? Thanks!

cdevadas · 2023-10-23T18:18:22Z

Would you mind giving more details on the issue you are trying to solve? Did you hit the problem in any real case? Does this change has kind of relation with the change: https://reviews.llvm.org/D145329? Thanks!

Yes, this patch would avoid https://reviews.llvm.org/D145329.
Splitting the si_end_cf block doesn't effectively solve the problem in all cases. OCL conformance tests break with incorrect spill insertion in the downstream compiler. D145329, in fact, fixed it but created an incorrect spill placement for a math library code. Because the split inserts the *_term version of the exec manipulation instruction in the block which is a terminator and causes the spill insertion before it for the offending test. That is wrong.
The root cause I believe isBasicBlockPrologue isn't complete. Fixing this helper function avoids the need for end_cf block split.
Most of the lit test changes in the patch now insert the spills after the exec manipulation instruction, but that wasn't the case earlier.

arsenm · 2023-10-24T02:04:48Z

I think isBasicBlockPrologue is unworkable and should be removed. I don't think it's sensible to do ad-hoc exec mask tracking like this and we need a better control flow representation. I think for the moment we are better off splitting blocks, and then starting to augment MachineBasicBlock with an annotation of the divergent CFG.

cdevadas · 2023-10-24T04:10:30Z

I think isBasicBlockPrologue is unworkable and should be removed. I don't think it's sensible to do ad-hoc exec mask tracking like this and we need a better control flow representation. I think for the moment we are better off splitting blocks, and then starting to augment MachineBasicBlock with an annotation of the divergent CFG.

The spills inserted at the block entry for BBLiveIns are part of the BB prolog. That's what I believe. So, teaching isBasicBlockPrologue seems the right approach to me unless there is a strong reason against it.
The initially proposed bb split approach still needs a fixup in the isBasicBlockPrologue. Any subsequent spill insertion in the same block causes a problem because the first spill would break the bb prolog which is wrong.

ruiling · 2023-10-25T02:56:16Z

I agree that block prologue is hard to get correct behavior within LLVM. But as we strongly depend on it now, we cannot remove it before we came up with alternative solution. I think one possible way to remove block prologue is: don't reconverge the threads at block header, instead always setup the exec correctly in the terminators of predecessor block. But there are still many details to work out.

I don't have strong preference which way we should go (either always split block or the change here). But it is important it should work in all cases we have seen (e.g. livenrange-split might still insert vector instruction before end_cf even it is a terminator.)

We definitely want a test for such change, a '.ll' test is always helpful for revisiting the issue again.

cdevadas · 2023-10-25T18:13:41Z

"I don't have strong preference which way we should go (either always split block or the change here). But it is important it should work in all cases we have seen (e.g. livenrange-split might still insert vector instruction before end_cf even it is a terminator.)"
I have a patch for it that introduces a separate target COPY opcode for liverange split (say AMDGPU::LR_SPLIT_COPY).
The patch also needed a fixup in the VirtRegMatrix to detect the new copy opcode by replacing "MI.isIdentityCopy()" with a target hook "TII->isIdentityCopyInstr(MI)".
But a few other tests including X86 failed as the copy-like opcodes will also be included.
I will post it as a follow-up patch as we don't have a failing case directly.
The live-range split issue occurs only when we apply https://reviews.llvm.org/D145329.
This patch passed in our internal testing (PSDB) and landing it would unblock my ongoing work related to wwm-allocation and a major issue observed in a customer code.

"We definitely want a test for such change, a '.ll' test is always helpful for revisiting the issue again."
Like I said earlier, we don't have a ready .ll test. I will try to get one forcing some reg-constraints.

cdevadas · 2023-10-25T18:43:56Z

Rebased to reflect the pre-patches.

arsenm

I think the title needs to be reworked. "Improved" doesn't really explain what this does

arsenm · 2023-10-26T02:24:17Z

llvm/lib/Target/AMDGPU/SIInstrInfo.cpp

-  return !MI.isTerminator() && MI.getOpcode() != AMDGPU::COPY &&
-         MI.modifiesRegister(AMDGPU::EXEC, &RI);
+  // It should cover all possible instructions inserted during regalloc in the
+  // block prolog.


This should elaborate on why. It's still a pretty dicey assumption

ruiling · 2023-10-26T03:07:23Z

This patch passed in our internal testing (PSDB) and landing it would unblock my ongoing work related to wwm-allocation and a major issue observed in a customer code.

ok, sounds fine to me to solve the liverange-issue separately.

Like I said earlier, we don't have a ready .ll test. I will try to get one forcing some reg-constraints.

But I think you should have a test that would fail without this change (not the liverange-split related issue) and pass with this change?

cdevadas · 2023-10-26T11:23:24Z

But I think you should have a test that would fail without this change (not the liverange-split related issue) and pass with this change?
I thought you were talking about the test case for liverange-split. I have a test for the SPILL issue. I can pre-commit the test first with the original issue indicated in it.

cdevadas · 2023-10-26T14:11:36Z

Rebase + Addressed review comments.

arsenm · 2023-10-26T14:41:33Z

llvm/lib/Target/AMDGPU/SIInstrInfo.cpp

-  return !MI.isTerminator() && MI.getOpcode() != AMDGPU::COPY &&
-         MI.modifiesRegister(AMDGPU::EXEC, &RI);
+  // It should cover all possible instructions (spills, copies, etc.) early
+  // inserted during RA in the block prolog to avoid the broken prologs that


I think the "broken prolog" reads a bit too vaguely. Also not sure what "early" is doing here. How about

"We need to handle instructions which may be inserted during register allocation to handle the prolog. The initial prolog instruction may have been separated from the start of the block by spills and copies inserted needed by the prolog"

I still don't feel particularly comfortable trusting any spills here were for the prolog

github-actions · 2023-10-26T15:39:06Z

✅ With the latest revision this PR passed the C/C++ code formatter.

arsenm

We really need something properly embedded in the IR for this but I guess we can go with this for now

The insertion point determined by RA while attempting spills and liverange split at the beginning of a block goes wrong at times, and the newly inserted vector instructions are placed before the exec-mask restore instruction which is wrong. It occurs mainly due to the dependency on isBasicBlockPrologue that doesn't account early inserted instructions (spills and splits) during RA and causes the block prolog break. A better approach for deciding the insertion point should be worked out. For now, improving the helper function to consider all possible early insertions. This patch includes the spill instructions. The copies associated with liverange split should also be included in the block prolog.

piotrAMD · 2023-10-30T15:13:50Z

Starting from this commit we are hitting the following assertion when running Vulkan CTS:

lib/CodeGen/SplitKit.cpp:1648: void llvm::SplitEditor::splitLiveThroughBlock(unsigned int, unsigned int, llvm::SlotIndex, unsigned int, llvm::SlotIndex): Assertion `(!LeaveBefore || Idx <= LeaveBefore) && "Interference"' failed.

I'll prepare a reproducer.

piotrAMD · 2023-10-30T17:57:16Z

repro.txt

llc -mtriple=amdgcn--amdpal -march=amdgcn -mcpu=gfx1030 repro.txt

…nstructions (llvm#69924)" This reverts commit a0eb6b8. Caused CTS failures: ubuntu_20-04_navi21_vk-llvm-test / CTS.dEQP-VK.ssbo.phys.layout.unsized_nested_struct_array.single_buffer.scalar_instance_array_comp_access ubuntu_20-04_navi21_vk-llvm-test / CTS.dEQP-VK.ssbo.phys.layout.unsized_nested_struct_array.single_buffer.scalar_instance_array_comp_access_store_cols ubuntu_22-04_navi31_vk-llvm-test / CTS.dEQP-VK.ssbo.phys.layout.unsized_nested_struct_array.single_buffer.scalar_instance_array_comp_access ubuntu_22-04_navi31_vk-llvm-test / CTS.dEQP-VK.ssbo.phys.layout.unsized_nested_struct_array.single_buffer.scalar_instance_array_comp_access_store_cols Change-Id: Id804440b442cae543c2dd0f5c2ba1bcb6b805d1e

Local branch amd-gfx d648e11 Revert "[AMDGPU] Try to fix the block prologs broken by RA inserted instructions (llvm#69924)" Remote branch main 75b3c3d [ARM] Disable UpperBound loop unrolling for MVE tail predicated loops. (llvm#69709) Change-Id: I5ed179024ddce969c97745bd3947ac42772629c0

piotrAMD · 2023-11-09T13:21:08Z

Ping. Is this being looked into?
We have reverted the change in the graphics downstream branch, but the issue exists in trunk.

cdevadas · 2023-11-09T15:42:10Z

Ping. Is this being looked into? We have reverted the change in the graphics downstream branch, but the issue exists in trunk.

I'm looking into it. The assertion started coming because the insertion point is now modified. SplitKit expects the COPY during liverange split to be inserted at the top of the BB. But after my patch, the insertion point vary and they might not be always inserted before the first instruction. One or the other way, the greedy and Fastalloc have problem with the recomputed bb-prolog. I'm trying to see if the insertion point can be retained for SGPR spills/copies as an immediate fixup. But isBasicBlockPrologue being a common interface used for InlineSpiller, SplitKit and MachineIRSink to identify the insertion point. It is becoming difficult to handle the case. I will provide an update soon.

cdevadas · 2023-11-16T05:14:04Z

repro.txt

llc -mtriple=amdgcn--amdpal -march=amdgcn -mcpu=gfx1030 repro.txt

@piotrAMD #72140 would fix it. I merged it to the upstream.

… instructions (llvm#69924)" This reverts commit d648e11. And unreverts commit a0eb6b8.

…ons (llvm#69924) The insertion point determined by RA while attempting spills and liverange split at the beginning of a block goes wrong at times, and the newly inserted vector instructions are placed before the exec-mask restore instruction which is wrong. It occurs mainly due to the dependency on isBasicBlockPrologue that doesn't account early inserted instructions (spills and splits) during RA and causes the block prolog break. A better approach for deciding the insertion point should be worked out. For now, improving the helper function to consider all possible early insertions. This patch includes the spill instructions. The copies associated with liverange split should also be included in the block prolog. Change-Id: I84ce856add0028b5dadf5a518f750a35bb27ecf3

PRs llvm#69924 and llvm#72140 modified SIInstrInfo::isBasicBlockPrologue to skip over EXEC modifications and spills when allocating VGPRs. But treating VGPR spills as part of the prologue can confuse the register allocator as in llvm#109294, so restrict it to SGPR spills, which were inserted during SGPR allocation which is done in an earlier pass. Fixes: llvm#109294 Fixes: SWDEV-485841

…gue (#109439) PRs #69924 and #72140 modified SIInstrInfo::isBasicBlockPrologue to skip over EXEC modifications and spills when allocating VGPRs. But treating VGPR spills as part of the prologue can confuse the register allocator as in #109294, so restrict it to SGPR spills, which were inserted during SGPR allocation which is done in an earlier pass. Fixes: #109294 Fixes: SWDEV-485841

PRs llvm#69924 and llvm#72140 modified SIInstrInfo::isBasicBlockPrologue to skip over EXEC modifications and spills when allocating VGPRs. But treating VGPR spills as part of the prologue can confuse the register allocator as in llvm#109294, so restrict it to SGPR spills, which were inserted during SGPR allocation which is done in an earlier pass. Fixes: llvm#109294 Fixes: SWDEV-485841 Change-Id: I328fb2edfca8110ea36c94812e60bc1d7663c266

…gue (llvm#109439) PRs llvm#69924 and llvm#72140 modified SIInstrInfo::isBasicBlockPrologue to skip over EXEC modifications and spills when allocating VGPRs. But treating VGPR spills as part of the prologue can confuse the register allocator as in llvm#109294, so restrict it to SGPR spills, which were inserted during SGPR allocation which is done in an earlier pass. Fixes: llvm#109294 Fixes: SWDEV-485841 Change-Id: Ice1ab75074aa380c13e07c452a2854f78ff37ce7

cdevadas requested review from arsenm and perlfu October 23, 2023 12:48

llvmbot added backend:AMDGPU llvm:globalisel labels Oct 23, 2023

cdevadas force-pushed the improve-isBasicBlockPrologue-function branch 2 times, most recently from 54d69ce to a218ab9 Compare October 25, 2023 18:41

arsenm requested changes Oct 26, 2023

View reviewed changes

cdevadas force-pushed the improve-isBasicBlockPrologue-function branch 2 times, most recently from 4285366 to d988509 Compare October 26, 2023 14:09

arsenm reviewed Oct 26, 2023

View reviewed changes

cdevadas force-pushed the improve-isBasicBlockPrologue-function branch from d988509 to 36013c0 Compare October 26, 2023 15:28

arsenm approved these changes Oct 27, 2023

View reviewed changes

cdevadas force-pushed the improve-isBasicBlockPrologue-function branch from 36013c0 to ffba940 Compare October 27, 2023 12:28

cdevadas merged commit a0eb6b8 into llvm:main Oct 27, 2023

Guzhu-AMD pushed a commit to GPUOpen-Drivers/llvm-project that referenced this pull request Nov 23, 2023

Unrevert "[AMDGPU] Try to fix the block prologs broken by RA inserted…

40ac47d

… instructions (llvm#69924)" This reverts commit d648e11. And unreverts commit a0eb6b8.

alex-t mentioned this pull request Sep 16, 2024

[AMDGPU] Change CF intrinsics lowering to reconverge on predecessors #108596

Open

jayfoad mentioned this pull request Sep 20, 2024

[AMDGPU] When allocating VGPRs, VGPR spills are not part of the prologue #109439

Merged

[AMDGPU] Improve isBasicBlockPrologue helper function #69924

[AMDGPU] Improve isBasicBlockPrologue helper function #69924

Uh oh!

Conversation

cdevadas commented Oct 23, 2023

Uh oh!

llvmbot commented Oct 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ruiling commented Oct 23, 2023

Uh oh!

cdevadas commented Oct 23, 2023

Uh oh!

arsenm commented Oct 24, 2023

Uh oh!

cdevadas commented Oct 24, 2023

Uh oh!

ruiling commented Oct 25, 2023

Uh oh!

cdevadas commented Oct 25, 2023

Uh oh!

cdevadas commented Oct 25, 2023

Uh oh!

arsenm left a comment

Choose a reason for hiding this comment

Uh oh!

arsenm Oct 26, 2023

Choose a reason for hiding this comment

Uh oh!

ruiling commented Oct 26, 2023

Uh oh!

cdevadas commented Oct 26, 2023

Uh oh!

cdevadas commented Oct 26, 2023

Uh oh!

arsenm Oct 26, 2023

Choose a reason for hiding this comment

Uh oh!

cdevadas Oct 26, 2023

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Oct 26, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

arsenm left a comment

Choose a reason for hiding this comment

Uh oh!

piotrAMD commented Oct 30, 2023

Uh oh!

piotrAMD commented Oct 30, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

piotrAMD commented Nov 9, 2023

Uh oh!

cdevadas commented Nov 9, 2023

Uh oh!

cdevadas commented Nov 16, 2023

Uh oh!

Uh oh!

llvmbot commented Oct 23, 2023 •

edited

Loading

github-actions bot commented Oct 26, 2023 •

edited

Loading

piotrAMD commented Oct 30, 2023 •

edited

Loading