Skip to content

arm neon ldrb before st2 error lead to compare fail #64696

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
hstk30 opened this issue Aug 15, 2023 · 24 comments · Fixed by llvm/llvm-project-release-prs#679
Closed

arm neon ldrb before st2 error lead to compare fail #64696

hstk30 opened this issue Aug 15, 2023 · 24 comments · Fixed by llvm/llvm-project-release-prs#679

Comments

@hstk30
Copy link

hstk30 commented Aug 15, 2023

clang: https://godbolt.org/z/xW5s7s35q

gcc: https://godbolt.org/z/P5Wv7Y4zh

code

#include <arm_neon.h>

extern void abort (void);

int test_vst2_lane_u8 (const uint8_t *data) {
  uint8x8x2_t vectors;
  for (int i = 0; i < 2; i++, data += 8) {
    vectors.val[i] = vld1_u8 (data);
  }

  uint8_t temp[2];
  vst2_lane_u8 (temp, vectors, 6);

// printf("temp[0]: %d\n", temp[0]);
// printf("temp[1]: %d\n", temp[1]);
//printf("vectors.val[0][6]: %d\n", vget_lane_u8(vectors.val[0], 6));
//printf("vectors.val[1][6]: %d\n", vget_lane_u8(vectors.val[1], 6));

  for (int i = 0; i < 2; i++) {
    if (temp[i] != vget_lane_u8 (vectors.val[i], 6))   /* error */
      return 1;
  }
  return 0;
}

int main (int argc, char **argv)
{

  uint64_t orig_data[8] = {
    0x1234567890abcdefULL, 0x13579bdf02468aceULL,
  };

  if (test_vst2_lane_u8 ((const uint8_t *)orig_data))
    abort ();;

  return 0;
}

I see the asm

test_vst2_lane_u8:                      // @test_vst2_lane_u8
        sub     sp, sp, #16
        ldp     d0, d1, [x0]
        add     x8, sp, #12
        ldrb    w11, [sp, #13]               // temp[1]
        st2     { v0.b, v1.b }[6], [x8]      // vst2_lane_u8 (temp, vectors, 6);
        umov    w8, v0.b[6]                 
        ldrb    w9, [sp, #12]                 // temp[0]
        umov    w10, v1.b[6]
        cmp     w9, w8, uxtb
        ccmp    w11, w10, #0, eq
        cset    w0, ne
        add     sp, sp, #16
        ret

The error is caused because the temp[1] is loaded before st2, so the compare fail.

I switch it like :

        st2     { v0.b, v1.b }[6], [x8]      // vst2_lane_u8 (temp, vectors, 6);
        umov    w8, v0.b[6]
        ldrb    w11, [sp, #13]               // temp[1]                 
        ldrb    w9, [sp, #12]                 // temp[0]

and recompile it, it's OK.

Anyone have idea to fix it?

@llvmbot
Copy link
Member

llvmbot commented Aug 15, 2023

@llvm/issue-subscribers-backend-arm

@llvmbot
Copy link
Member

llvmbot commented Aug 15, 2023

@llvm/issue-subscribers-backend-aarch64

@efriedma-quic
Copy link
Collaborator

opt-bisect points to MachineScheduler? I suspect something is going wrong with alias analysis.

Changing the size of "temp" from 2 to 16 seems to work around the issue.

@hstk30
Copy link
Author

hstk30 commented Aug 15, 2023

opt-bisect points to MachineScheduler? I suspect something is going wrong with alias analysis.

Changing the size of "temp" from 2 to 16 seems to work around the issue.

Seem the size of temp > 12 is work, and <= 12 is fail.

Can you give some suggestions concretely? I'm new to llvm, I want do something, but not have direction.

@hstk30
Copy link
Author

hstk30 commented Aug 15, 2023

I location the pass which first lead to this problem

BISECT: NOT running pass (236) PostRA Machine Instruction Scheduler on function (test_vst2_lane_u8)

@davemgreen
Copy link
Collaborator

There is some code in AArch64TargetLowering::getTgtMemIntrinsic that tries to specify the info for what a target intrinsic will load/store (aarch64_neon_st2lane in this case). It looks like it is saying it is accessing a much larger vector than just 2 elements, and if that is larger than the underlying object then the aliasing analysis can start going wrong.

@hstk30
Copy link
Author

hstk30 commented Aug 16, 2023

AArch64 Indirect Thunks
Before post-MI-sched:
# Machine code for function test_vst2_lane_u8: NoPHIs, TracksLiveness, NoVRegs, TiedOpsRewritten, TracksDebugUserValues
Frame Objects:
  fi#0: size=2, align=4, at location [SP-4]
Function Live Ins: $x0

bb.0.entry:
  liveins: $x0
  $sp = frame-setup SUBXri $sp, 16, 0
  frame-setup CFI_INSTRUCTION def_cfa_offset <mcsymbol .Ltmp0>16
  $x8 = ADDXri $sp, 12, 0
  renamable $d0, renamable $d1 = LDPDi renamable $x0, 0 :: (load (s64) from %ir.data, align 1), (load (s64) from %ir.vectors.sroa.5.0.data.sroa_idx, align 1)
  ST2i8 renamable $q0_q1, 6, killed renamable $x8 :: (store (s128) into %ir.temp)
  renamable $w8 = UMOVvi8 renamable $q0, 6
  renamable $w9 = LDRBBui $sp, 12 :: (dereferenceable load (s8) from %ir.temp, align 4, !tbaa !6)
  renamable $w10 = UMOVvi8 renamable $q1, 6, implicit killed $q0_q1
  renamable $w11 = LDRBBui $sp, 13 :: (dereferenceable load (s8) from %ir.arrayidx11.1)
  dead $wzr = SUBSWrx killed renamable $w9, killed renamable $w8, 0, implicit-def $nzcv
  CCMPWr killed renamable $w11, killed renamable $w10, 0, 0, implicit-def $nzcv, implicit $nzcv
  renamable $w0 = CSINCWr $wzr, $wzr, 0, implicit $nzcv
  $sp = frame-destroy ADDXri $sp, 16, 0
  frame-destroy CFI_INSTRUCTION def_cfa_offset 0
  RET undef $lr, implicit $w0

# End machine code for function test_vst2_lane_u8.

********** MI Scheduling **********
test_vst2_lane_u8:%bb.0 entry
  From: $x8 = ADDXri $sp, 12, 0
    To: $sp = frame-destroy ADDXri $sp, 16, 0
 RegionInstrs: 10
ScheduleDAGMI::schedule starting
SU(0):   $x8 = ADDXri $sp, 12, 0
  # preds left       : 0
  # succs left       : 2
  # rdefs left       : 0
  Latency            : 3
  Depth              : 0
  Height             : 11
  Successors:
    SU(3): Out  Latency=1
    SU(2): Data Latency=3 Reg=$x8
SU(1):   renamable $d0, renamable $d1 = LDPDi renamable $x0, 0 :: (load (s64) from %ir.data, align 1), (load (s64) from %ir.vectors.sroa.5.0.data.sroa_idx, align 1)
  # preds left       : 0
  # succs left       : 5
  # rdefs left       : 0
  Latency            : 5
  Depth              : 0
  Height             : 13
  Successors:
    SU(3): Data Latency=4 Reg=$q0
    SU(5): Data Latency=0 Reg=$q0_q1
    SU(2): Data Latency=5 Reg=$q0_q1
    SU(5): Data Latency=5 Reg=$q1
    SU(9): Anti Latency=0
SU(2):   ST2i8 renamable $q0_q1, 6, renamable $x8 :: (store (s128) into %ir.temp)
  # preds left       : 2
  # succs left       : 2
  # rdefs left       : 0
  Latency            : 5
  Depth              : 5
  Height             : 8
  Predecessors:
    SU(1): Data Latency=5 Reg=$q0_q1
    SU(0): Data Latency=3 Reg=$x8
  Successors:
    SU(3): Anti Latency=0
    SU(4): Ord  Latency=1 Memory
SU(3):   renamable $w8 = UMOVvi8 renamable $q0, 6
  # preds left       : 3
  # succs left       : 1
  # rdefs left       : 0
  Latency            : 4
  Depth              : 5
  Height             : 8
  Predecessors:
    SU(2): Anti Latency=0
    SU(1): Data Latency=4 Reg=$q0
    SU(0): Out  Latency=1
  Successors:
    SU(7): Data Latency=4 Reg=$w8
SU(4):   renamable $w9 = LDRBBui $sp, 12 :: (dereferenceable load (s8) from %ir.temp, align 4, !tbaa !6)
  # preds left       : 1
  # succs left       : 1
  # rdefs left       : 0
  Latency            : 3
  Depth              : 6
  Height             : 7
  Predecessors:
    SU(2): Ord  Latency=1 Memory
  Successors:
    SU(7): Data Latency=3 Reg=$w9
SU(5):   renamable $w10 = UMOVvi8 renamable $q1, 6, implicit $q0_q1
  # preds left       : 2
  # succs left       : 1
  # rdefs left       : 0
  Latency            : 4
  Depth              : 5
  Height             : 7
  Predecessors:
    SU(1): Data Latency=0 Reg=$q0_q1
    SU(1): Data Latency=5 Reg=$q1
  Successors:
    SU(8): Data Latency=4 Reg=$w10
SU(6):   renamable $w11 = LDRBBui $sp, 13 :: (dereferenceable load (s8) from %ir.arrayidx11.1)
  # preds left       : 0
  # succs left       : 1
  # rdefs left       : 0
  Latency            : 3
  Depth              : 0
  Height             : 6
  Successors:
    SU(8): Data Latency=3 Reg=$w11
SU(7):   dead $wzr = SUBSWrx renamable $w9, renamable $w8, 0, implicit-def $nzcv
  # preds left       : 2
  # succs left       : 2
  # rdefs left       : 0
  Latency            : 3
  Depth              : 9
  Height             : 4
  Predecessors:
    SU(4): Data Latency=3 Reg=$w9
    SU(3): Data Latency=4 Reg=$w8
  Successors:
    SU(8): Out  Latency=1
    SU(8): Data Latency=1 Reg=$nzcv
SU(8):   CCMPWr renamable $w11, renamable $w10, 0, 0, implicit-def $nzcv, implicit $nzcv
  # preds left       : 4
  # succs left       : 1
  # rdefs left       : 0
  Latency            : 3
  Depth              : 10
  Height             : 3
  Predecessors:
    SU(7): Out  Latency=1
    SU(7): Data Latency=1 Reg=$nzcv
    SU(6): Data Latency=3 Reg=$w11
    SU(5): Data Latency=4 Reg=$w10
  Successors:
    SU(9): Data Latency=3 Reg=$nzcv
SU(9):   renamable $w0 = CSINCWr $wzr, $wzr, 0, implicit $nzcv
  # preds left       : 2
  # succs left       : 0
  # rdefs left       : 0
  Latency            : 3
  Depth              : 13
  Height             : 0
  Predecessors:
    SU(8): Data Latency=3 Reg=$nzcv
    SU(1): Anti Latency=0
ExitSU:   $sp = frame-destroy ADDXri $sp, 16, 0
  # preds left       : 0
  # succs left       : 0
  # rdefs left       : 0
  Latency            : 0
  Depth              : 0
  Height             : 0
Disabled scoreboard hazard recognizer
Critical Path: (PGS-RR) 13
** ScheduleDAGMI::schedule picking next node
Queue TopQ.P: 
Queue TopQ.A: 0 1 6 
  TopQ.A RemainingLatency 0 + 0c > CritPath 13
  Cand SU(0) ORDER                              
  Cand SU(1) TOP-PATH                  13 cycles 
Pick Top TOP-PATH  
Scheduling SU(1) renamable $d0, renamable $d1 = LDPDi renamable $x0, 0 :: (load (s64) from %ir.data, align 1), (load (s64) from %ir.vectors.sroa.5.0.data.sroa_idx, align 1)
  Ready @0c
  CortexA55UnitLd +2x2u
  *** Critical resource CortexA55UnitLd: 2c
  TopQ.A BotLatency SU(1) 13c
  *** Max MOps 2 at cycle 0
Cycle: 1 TopQ.A
TopQ.A @1c
  Retired: 2
  Executed: 2c
  Critical: 2c, 2 CortexA55UnitLd
  ExpectedLatency: 0c
  - Resource limited.
** ScheduleDAGMI::schedule picking next node
  SU(6) CortexA55UnitLd[0]=2c
Queue TopQ.P: 5 6 
Queue TopQ.A: 0 
Pick Top ONLY1     
Scheduling SU(0) $x8 = ADDXri $sp, 12, 0
  Ready @1c
  CortexA55UnitALU +1x1u
TopQ.A @1c
  Retired: 3
  Executed: 2c
  Critical: 2c, 2 CortexA55UnitLd
  ExpectedLatency: 0c
  - Resource limited.
** ScheduleDAGMI::schedule picking next node
Cycle: 2 TopQ.A
Queue TopQ.P: 5 2 
Queue TopQ.A: 6 
Pick Top ONLY1     
Scheduling SU(6) renamable $w11 = LDRBBui $sp, 13 :: (dereferenceable load (s8) from %ir.arrayidx11.1)
  Ready @2c
  CortexA55UnitLd +1x2u
TopQ.A @2c
  Retired: 4
  Executed: 3c
  Critical: 3c, 3 CortexA55UnitLd
  ExpectedLatency: 0c
  - Resource limited.
** ScheduleDAGMI::schedule picking next node
Cycle: 3 TopQ.A
Cycle: 5 TopQ.A
Queue TopQ.P: 
Queue TopQ.A: 5 2 
  TopQ.A RemainingLatency 0 + 5c > CritPath 13
  Latency limited both directions.
  Cand SU(5) ORDER                              
  Cand SU(2) TOP-PATH                  8 cycles 
Pick Top TOP-PATH  
Scheduling SU(2) ST2i8 renamable $q0_q1, 6, renamable $x8 :: (store (s128) into %ir.temp)
  Ready @5c
  CortexA55UnitSt +2x2u
  TopQ.A TopLatency SU(2) 5c
TopQ.A @5c
  Retired: 5
  Executed: 5c
  Critical: 3c, 3 CortexA55UnitLd
  ExpectedLatency: 5c
  - Latency limited.
** ScheduleDAGMI::schedule picking next node
Queue TopQ.P: 4 
Queue TopQ.A: 5 3 
  TopQ.A RemainingLatency 0 + 5c > CritPath 13
  Latency limited both directions.
  Cand SU(5) ORDER                              
  Cand SU(3) TOP-PATH                  8 cycles 
Pick Top TOP-PATH  
Scheduling SU(3) renamable $w8 = UMOVvi8 renamable $q0, 6
  Ready @5c
  CortexA55UnitFPALU +1x1u
  *** Max MOps 2 at cycle 5
Cycle: 6 TopQ.A
TopQ.A @6c
  Retired: 6
  Executed: 6c
  Critical: 3c, 3 CortexA55UnitLd
  ExpectedLatency: 5c
  - Latency limited.
** ScheduleDAGMI::schedule picking next node
Queue TopQ.P: 
Queue TopQ.A: 5 4 
  TopQ.A RemainingLatency 0 + 6c > CritPath 13
  Latency limited both directions.
  Cand SU(5) ORDER                              
  Cand SU(4) ORDER                              
Pick Top ORDER     
Scheduling SU(4) renamable $w9 = LDRBBui $sp, 12 :: (dereferenceable load (s8) from %ir.temp, align 4, !tbaa !6)
  Ready @6c
  CortexA55UnitLd +1x2u
  TopQ.A TopLatency SU(4) 6c
TopQ.A @6c
  Retired: 7
  Executed: 6c
  Critical: 4c, 4 CortexA55UnitLd
  ExpectedLatency: 6c
  - Latency limited.
** ScheduleDAGMI::schedule picking next node
Queue TopQ.P: 7 
Queue TopQ.A: 5 
Pick Top ONLY1     
Scheduling SU(5) renamable $w10 = UMOVvi8 renamable $q1, 6, implicit $q0_q1
  Ready @6c
  CortexA55UnitFPALU +1x1u
  *** Max MOps 2 at cycle 6
Cycle: 7 TopQ.A
TopQ.A @7c
  Retired: 8
  Executed: 7c
  Critical: 4c, 4 CortexA55UnitLd
  ExpectedLatency: 6c
  - Latency limited.
** ScheduleDAGMI::schedule picking next node
Cycle: 9 TopQ.A
Queue TopQ.P: 
Queue TopQ.A: 7 
Pick Top ONLY1     
Scheduling SU(7) dead $wzr = SUBSWrx renamable $w9, renamable $w8, 0, implicit-def $nzcv
  Ready @9c
  CortexA55UnitALU +1x1u
  TopQ.A TopLatency SU(7) 9c
TopQ.A @9c
  Retired: 9
  Executed: 9c
  Critical: 4c, 4 CortexA55UnitLd
  ExpectedLatency: 9c
  - Latency limited.
** ScheduleDAGMI::schedule picking next node
Cycle: 10 TopQ.A
Queue TopQ.P: 
Queue TopQ.A: 8 
Pick Top ONLY1     
Scheduling SU(8) CCMPWr renamable $w11, renamable $w10, 0, 0, implicit-def $nzcv, implicit $nzcv
  Ready @10c
  *** Critical resource NumMicroOps: 5c
  CortexA55UnitALU +1x1u
  TopQ.A TopLatency SU(8) 10c
TopQ.A @10c
  Retired: 10
  Executed: 10c
  Critical: 5c, 10 MOps
  ExpectedLatency: 10c
  - Latency limited.
** ScheduleDAGMI::schedule picking next node
Cycle: 11 TopQ.A
Cycle: 13 TopQ.A
Queue TopQ.P: 
Queue TopQ.A: 9 
Pick Top ONLY1     
Scheduling SU(9) renamable $w0 = CSINCWr $wzr, $wzr, 0, implicit $nzcv
  Ready @13c
  CortexA55UnitALU +1x1u
  TopQ.A TopLatency SU(9) 13c
TopQ.A @13c
  Retired: 11
  Executed: 13c
  Critical: 5c, 11 MOps
  ExpectedLatency: 13c
  - Latency limited.
** ScheduleDAGMI::schedule picking next node
*** Final schedule for %bb.0 ***
SU(1):   renamable $d0, renamable $d1 = LDPDi renamable $x0, 0 :: (load (s64) from %ir.data, align 1), (load (s64) from %ir.vectors.sroa.5.0.data.sroa_idx, align 1)
SU(0):   $x8 = ADDXri $sp, 12, 0
SU(6):   renamable $w11 = LDRBBui $sp, 13 :: (dereferenceable load (s8) from %ir.arrayidx11.1)
SU(2):   ST2i8 renamable $q0_q1, 6, renamable $x8 :: (store (s128) into %ir.temp)
SU(3):   renamable $w8 = UMOVvi8 renamable $q0, 6
SU(4):   renamable $w9 = LDRBBui $sp, 12 :: (dereferenceable load (s8) from %ir.temp, align 4, !tbaa !6)
SU(5):   renamable $w10 = UMOVvi8 renamable $q1, 6, implicit $q0_q1
SU(7):   dead $wzr = SUBSWrx renamable $w9, renamable $w8, 0, implicit-def $nzcv
SU(8):   CCMPWr renamable $w11, renamable $w10, 0, 0, implicit-def $nzcv, implicit $nzcv
SU(9):   renamable $w0 = CSINCWr $wzr, $wzr, 0, implicit $nzcv

Fixup kills for %bb.0

gmi change from

  ST2i8 renamable $q0_q1, 6, killed renamable $x8 :: (store (s128) into %ir.temp)
  renamable $w8 = UMOVvi8 renamable $q0, 6
  renamable $w9 = LDRBBui $sp, 12 :: (dereferenceable load (s8) from %ir.temp, align 4, !tbaa !6)
  renamable $w10 = UMOVvi8 renamable $q1, 6, implicit killed $q0_q1
  renamable $w11 = LDRBBui $sp, 13 :: (dereferenceable load (s8) from %ir.arrayidx11.1)

to

SU(6):   renamable $w11 = LDRBBui $sp, 13 :: (dereferenceable load (s8) from %ir.arrayidx11.1)
SU(2):   ST2i8 renamable $q0_q1, 6, renamable $x8 :: (store (s128) into %ir.temp)
SU(3):   renamable $w8 = UMOVvi8 renamable $q0, 6
SU(4):   renamable $w9 = LDRBBui $sp, 12 :: (dereferenceable load (s8) from %ir.temp, align 4, !tbaa !6)
SU(5):   renamable $w10 = UMOVvi8 renamable $q1, 6, implicit $q0_q1

in the code

https://github.com/llvm/llvm-project/blob/c5f763b563e37ebe26bfd4a012269482d54d0a80/llvm/lib/CodeGen/MachineScheduler.cpp#L627C1-L629C28

@hstk30
Copy link
Author

hstk30 commented Aug 17, 2023

According to the debug info:

SU(2):   ST2i8 renamable $q0_q1, 6, renamable $x8 :: (store (s128) into %ir.temp)
  # preds left       : 2
  # succs left       : 2
  # rdefs left       : 0
  Latency            : 5
  Depth              : 5
  Height             : 8
  Predecessors:
    SU(1): Data Latency=5 Reg=$q0_q1
    SU(0): Data Latency=3 Reg=$x8
  Successors:
    SU(3): Anti Latency=0
    SU(4): Ord  Latency=1 Memory
...

SU(6):   renamable $w11 = LDRBBui $sp, 13 :: (dereferenceable load (s8) from %ir.arrayidx11.1)
  # preds left       : 0
  # succs left       : 1
  # rdefs left       : 0
  Latency            : 3
  Depth              : 0
  Height             : 6
  Successors:
    SU(8): Data Latency=3 Reg=$w11

SU(2) should have an another Successors SU(6) like SU(4). But not.

@hstk30
Copy link
Author

hstk30 commented Aug 17, 2023

If alias analysis goes wrong, how can I debug alias analysis? I have no idea. 😬

@efriedma-quic
Copy link
Collaborator

The scheduler ends up calling MachineInstr::mayAlias, I think, to determine what edges are necessary.

@hstk30
Copy link
Author

hstk30 commented Aug 18, 2023

Yes, alias analysis goes wrong in this function.

https://github.com/llvm/llvm-project/blob/0816b3efbfaaf958a3f2e842aa3eacd525e7ae12/llvm/lib/CodeGen/MachineInstr.cpp#L1282C1-L1347C2

for temp[2] it return false, for temp[16] it return true in the last return line.
https://github.com/llvm/llvm-project/blob/0816b3efbfaaf958a3f2e842aa3eacd525e7ae12/llvm/lib/CodeGen/MachineInstr.cpp#L1343C1-L1346C66

So have good guide to debug alias analysis? 🤯

@hstk30
Copy link
Author

hstk30 commented Aug 18, 2023

Disable basic aa is right.

https://godbolt.org/z/rnoThYxKq

Below is aa-trace info about ST2i8 :

ST2i8 %7:qq, 6, %8:gpr64sp :: (store (s128) into %ir.temp)

Start   %temp = alloca [2 x i8], align 4 @ LocationSize::precise(16),   %arrayidx11.1 = getelementptr inbounds [2 x i8], ptr %temp, i64 0, i64 1 @ LocationSize::precise(1)
End   %temp = alloca [2 x i8], align 4 @ LocationSize::precise(16),   %arrayidx11.1 = getelementptr inbounds [2 x i8], ptr %temp, i64 0, i64 1 @ LocationSize::precise(1) = NoAlias
Start   %temp = alloca [16 x i8], align 4 @ LocationSize::precise(16),   %arrayidx11.1 = getelementptr inbounds [16 x i8], ptr %temp, i64 0, i64 1 @ LocationSize::precise(1)
  Start   %temp = alloca [16 x i8], align 4 @ LocationSize::beforeOrAfterPointer,   %temp = alloca [16 x i8], align 4 @ LocationSize::beforeOrAfterPointer
  End   %temp = alloca [16 x i8], align 4 @ LocationSize::beforeOrAfterPointer,   %temp = alloca [16 x i8], align 4 @ LocationSize::beforeOrAfterPointer = MustAlias
End   %temp = alloca [16 x i8], align 4 @ LocationSize::precise(16),   %arrayidx11.1 = getelementptr inbounds [16 x i8], ptr %temp, i64 0, i64 1 @ LocationSize::precise(1) = PartialAlias (off 1)

the diff of 2 and 16 is the size of temp (size of 2 is fail, size of 16 is work)

@hstk30
Copy link
Author

hstk30 commented Aug 18, 2023

https://github.com/llvm/llvm-project/blob/673ef8ceaece6c9a7194474ef7d97b3b240c0dc5/llvm/lib/Analysis/BasicAliasAnalysis.cpp#L117C1-L158C2

Ahead, I think this case is hit the c3 context in function isObjectSmallerThan .

https://godbolt.org/z/xW5s7s35q

In the size of temp 2 case, the getObjectSize return 4 because temp is stored in sp + 12, so the object size is 4 ? But the Size return by getMinimalExtentFrom is 16.

https://github.com/llvm/llvm-project/blob/673ef8ceaece6c9a7194474ef7d97b3b240c0dc5/llvm/lib/Analysis/BasicAliasAnalysis.cpp#L160C1-L180C2

@efriedma-quic
Copy link
Collaborator

Right, that's basically the same conclusion as #64696 (comment) : the memory operand says the store writes to 16 bytes, but the object in question is only 4 bytes, so alias analysis concludes the store can't access the object.

@hstk30
Copy link
Author

hstk30 commented Aug 19, 2023

So, vst3_lane_u8 also have this problem, vst4_lane_u8 too.

https://godbolt.org/z/4sfT3rzWr

Thx your patient @efriedma-quic .

So, What can we do to fix this problem? @davemgreen @efriedma-quic

@davemgreen
Copy link
Collaborator

Are you willing to put together a patch to AArch64TargetLowering::getTgtMemIntrinsic that gets the memVT more correct for aarch64_neon_st2 (and st3/st4, and maybe the loads too)? With a testcase it sounds like a sensible fix.

@hstk30
Copy link
Author

hstk30 commented Aug 19, 2023

I want to try it, but I'm not so familiar with llvm. Can you give me some suggestions?
Let me try.

@davemgreen
Copy link
Collaborator

There are some details on how to write patches and contribute in https://llvm.org/docs/Contributing.html if it is helpful, and some extra details on phabricator in https://llvm.org/docs/Phabricator.html#phabricator-reviews.

@hstk30
Copy link
Author

hstk30 commented Aug 23, 2023

@davemgreen Hi, can you review it? https://reviews.llvm.org/D158611

hstk30 added a commit to hstk30/llvm-project that referenced this issue Aug 31, 2023
stx lane memory size set too big lead to alias analysis goes wrong.
llvm#64696
@davemgreen davemgreen reopened this Aug 31, 2023
@davemgreen davemgreen added this to the LLVM 17.0.X Release milestone Aug 31, 2023
@github-project-automation github-project-automation bot moved this to Needs Triage in LLVM Release Status Aug 31, 2023
@davemgreen
Copy link
Collaborator

/ cherry-pick db8f6c0

@davemgreen
Copy link
Collaborator

/cherry-pick db8f6c0

@llvmbot
Copy link
Member

llvmbot commented Sep 1, 2023

/branch llvm/llvm-project-release-prs/issue64696

llvmbot pushed a commit to llvm/llvm-project-release-prs that referenced this issue Sep 1, 2023
StN lane memory size set too big lead to alias analysis goes wrong.

Fixes llvm/llvm-project#64696

Differential Revision: https://reviews.llvm.org/D158611

(cherry picked from commit db8f6c009e5a17d304be7404e50eb20b2dd0c75b)
@llvmbot
Copy link
Member

llvmbot commented Sep 1, 2023

/pull-request llvm/llvm-project-release-prs#679

@hstk30 hstk30 closed this as completed Sep 1, 2023
@EugeneZelenko
Copy link
Contributor

Not merged yet.

@EugeneZelenko EugeneZelenko reopened this Sep 1, 2023
@tru tru moved this from Needs Triage to Needs Review in LLVM Release Status Sep 2, 2023
tru pushed a commit to llvm/llvm-project-release-prs that referenced this issue Sep 4, 2023
StN lane memory size set too big lead to alias analysis goes wrong.

Fixes llvm/llvm-project#64696

Differential Revision: https://reviews.llvm.org/D158611

(cherry picked from commit db8f6c009e5a17d304be7404e50eb20b2dd0c75b)
@tru tru moved this from Needs Review to Done in LLVM Release Status Sep 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

Successfully merging a pull request may close this issue.

5 participants