Skip to content

x86 jitstress / gcstress failure with methods that do not return #10085

@AndyAyersMS

Description

@AndyAyersMS

Finally think I have gotten to the root cause of the raft of failures seen in x86 jitstress=2 gcstress=0xC zapdisable=1 runs discussed over in dotnet/coreclr#17330. The case I have been looking at is baseservices\threading\waithandle\waitall\waitallex3 -- the other waitall tests should fail for similar reasons.

The key method here is System.Threading.WaitHandle:WaitAll(ref,int,bool):bool. This method is normally prejitted but with zapdisable=1 it gets jitted during this run. Jitstress kicks in and via STRESS_BB_PROFILE ends up altering the block layout. So we end up not moving BBJ_THROW blocks to the end of the method.

That leaves us with this sequence of generated code:

07150225 8b4304          mov     eax,dword ptr [ebx+4]      << -- ESI dead here per gc info
07150228 0580000000      add     eax,80h
0715022d 3bc6            cmp     eax,esi
0715022f 7ec3            jle     System_Private_CoreLib!System.Threading.WaitHandle.WaitAll(System.Threading.WaitHandle[], Int32, Boolean)+0x174 (071501f4)
;; jit knows this call does not return
07150231 ff151c1e4205    call    dword ptr ds:[5421E1Ch] (System.Threading.WaitHandle.ThrowAbandonedMutexException(), mdToken: 06002692)
;; join
07150237 8945f0          mov     dword ptr [ebp-10h],eax            <<< ESI live here
0715023a 8b548608        mov     edx,dword ptr [esi+eax*4+8]

This method also has fully interruptible GC reporting. The jit knows ThrowAbandonedMutexException does not return and reflects that in its liveness calculations. So ESI is not live at the call but becomes live right after the call.

During the run at the point of failure, this method has just called ThrowAbandonedMutexException
GC stress kicks off a GC in the prestub. The VM walks the stack for GC and reports that the EIP for this frame is 07150237. When enumerating GC roots, it reports ESI as live. But when in the callee, ESI is not in fact live and contains the scalar value 0x80. So the reporting triggers an assert for an invalid objref.

The logic in EECodeManager::EnumGcRefs knows that for stack locations liveness can change across a call, but it (implicitly?) assumes that for callee-saves registers that liveness before/after the call should be the same. So when reporting it looks at the after call liveness state.

Possible fixes here include:

  • modify EnumGcRefs to make the same curOffs adjustment for non-active fully interruptible callee save registers that it does for stack locations, to use their "in call" liveness state.
  • modify the jit to pad after normal user calls that it knows do not return so that the before and after liveness state of callee saves registers is the same. I have this prototyped and suspect it will be pretty surgical.

There are still a few loose ends to chase down:

  • why we didn't notice this long ago, as the does not return logic has been in the jit a while
  • why this only impacts x86
  • whether or not this situation can arise without jitstress

Metadata

Metadata

Assignees

Labels

GCStressJitStressCLR JIT issues involving JIT internal stress modesarch-x86area-CodeGen-coreclrCLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions