-
Notifications
You must be signed in to change notification settings - Fork 5.2k
Fix lazy machine state unwinding for MSVC epilogues on x86 #45906
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
src/coreclr/vm/i386/gmsx86.cpp
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is "MOV ESP, EBP" correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. I'll update the comment when I finish this patch. This still doesn't solve the whole issue, but I got roped into some other issues the past couple of weeks. Thanks for noting this.
I have hit this crash in #46244. I am fixing it by converting the FCall w/ HMF to QCall. Identifying all FCalls with this problem and converting them to QCalls is one potential way to deal with this problem. |
0e978df
to
c67bf99
Compare
@hoyosjs is this change finished or does it require more work? |
This hadn't fixed the debugger logging issue and i went on vacation. Jan fixed this in master by turning the FCall into a QCal. Did you find another case where this hit? |
Yes, there are several other FCalls that have this problem. jkotas@40af820 that I have shared with you earlier is a hack to identify them. |
I tested this again, and thought it makes sense and the first unwind and a manual walk through the yield the same ebp/esp/return address, somehow later walks end up in a bad walk - which made me wonder if it could have anything to do with tiering. Disabling tiering stops reproducing the issue (even in the absence of the fix, not sure what it changes in terms of places where this could happen). I'm going to port the debugger fix first as this seems to be getting long. |
Some manifestations of this bug will be masked when a method higher on the stack saves more non-volatile registers or uses different sequence to save them. It is likely why you see it no longer crash with tiering disabled. You should check the values of all non-volatile registers after the unwind, not just ebp/esp/return address. |
The repro I was using on top of 5.0 was going from main directly to the FCall: using System.Diagnostics;
class Program
{
public const int NUM_ITEMS = 5_000_000;
static void Main(string[] args)
{
for (int i = 0; i < NUM_ITEMS; i++)
{
string a = new string("Hello");
Debugger.Log(0, null, a);
}
}
} And on the first pass the registers in the lazy state (so |
Actually, this might have been correct all along. Having the native debugger attached with some breakpoint set might have been what tripped the FCALL. I let this run for a while under VS and see no crash, where as before it was in a sub-second manner. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. I have double checked it under debugger and I have convinced myself that this works fine.
I think this change is low-risk enough to be ported to servicing instead of the QCall changes.
MSVC introduced changes into their prologue, epilogue helpers. Particularly
_EH_prolog3_catch_GS_align
/_EH_epilog3_GS_align
had amov esp, ebx
instruction the lazy unwinding didn't handle. This helper ended up getting called by 3 FCalls in all coreclr:DebugDebugger::Log
StubHelpers::ValidateObjec
COMDelegate::BindToMethodName
_EH_epilog3_align
has no direct calls for an FCall. All go though the GS variant first.As for the other reported reliability issue in
ThreadNative::GetThreadDeserializationTracker
, that one uses another helper_EH_prolog3_catch
/_EH_prolog3_catch_GS
. I hand checked the helper, and there's nothing I can see the unwinder wouldn't gracefully interpret. It's also unlikely these stubs would cause an issue just given the amount of FCalls they support (roughly half the CLR's FCalls use that variant).