Skip to content

Conversation

hoyosjs
Copy link
Member

@hoyosjs hoyosjs commented Dec 10, 2020

MSVC introduced changes into their prologue, epilogue helpers. Particularly _EH_prolog3_catch_GS_align/_EH_epilog3_GS_align had a mov esp, ebx instruction the lazy unwinding didn't handle. This helper ended up getting called by 3 FCalls in all coreclr:

  • DebugDebugger::Log
  • StubHelpers::ValidateObjec
  • COMDelegate::BindToMethodName

_EH_epilog3_align has no direct calls for an FCall. All go though the GS variant first.

As for the other reported reliability issue in ThreadNative::GetThreadDeserializationTracker, that one uses another helper _EH_prolog3_catch/_EH_prolog3_catch_GS. I hand checked the helper, and there's nothing I can see the unwinder wouldn't gracefully interpret. It's also unlikely these stubs would cause an issue just given the amount of FCalls they support (roughly half the CLR's FCalls use that variant).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is "MOV ESP, EBP" correct?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. I'll update the comment when I finish this patch. This still doesn't solve the whole issue, but I got roped into some other issues the past couple of weeks. Thanks for noting this.

@jkotas
Copy link
Member

jkotas commented Dec 22, 2020

I have hit this crash in #46244. I am fixing it by converting the FCall w/ HMF to QCall. Identifying all FCalls with this problem and converting them to QCalls is one potential way to deal with this problem.

@hoyosjs hoyosjs marked this pull request as ready for review January 6, 2021 01:26
@hoyosjs hoyosjs force-pushed the juhoyosa/fix-stackwalk-x86 branch from 0e978df to c67bf99 Compare January 6, 2021 02:41
@janvorli
Copy link
Member

janvorli commented Feb 8, 2021

@hoyosjs is this change finished or does it require more work?

@hoyosjs
Copy link
Member Author

hoyosjs commented Feb 8, 2021

This hadn't fixed the debugger logging issue and i went on vacation. Jan fixed this in master by turning the FCall into a QCal. Did you find another case where this hit?

@jkotas
Copy link
Member

jkotas commented Feb 8, 2021

Yes, there are several other FCalls that have this problem. jkotas@40af820 that I have shared with you earlier is a hack to identify them.

@hoyosjs hoyosjs closed this Feb 11, 2021
@hoyosjs hoyosjs reopened this Feb 11, 2021
@hoyosjs
Copy link
Member Author

hoyosjs commented Feb 11, 2021

I tested this again, and thought it makes sense and the first unwind and a manual walk through the yield the same ebp/esp/return address, somehow later walks end up in a bad walk - which made me wonder if it could have anything to do with tiering. Disabling tiering stops reproducing the issue (even in the absence of the fix, not sure what it changes in terms of places where this could happen). I'm going to port the debugger fix first as this seems to be getting long.

@jkotas
Copy link
Member

jkotas commented Feb 11, 2021

Some manifestations of this bug will be masked when a method higher on the stack saves more non-volatile registers or uses different sequence to save them. It is likely why you see it no longer crash with tiering disabled.

You should check the values of all non-volatile registers after the unwind, not just ebp/esp/return address.

@hoyosjs
Copy link
Member Author

hoyosjs commented Feb 11, 2021

The repro I was using on top of 5.0 was going from main directly to the FCall:

using System.Diagnostics;

class Program
{
    public const int NUM_ITEMS = 5_000_000;

    static void Main(string[] args)
    {
        for (int i = 0; i < NUM_ITEMS; i++)
        {
            string a = new string("Hello");
            Debugger.Log(0, null, a);
        }
    }
}

And on the first pass the registers in the lazy state (so edi, esi, ebx) also matched.

@hoyosjs hoyosjs closed this Feb 13, 2021
@hoyosjs hoyosjs reopened this Feb 13, 2021
@hoyosjs
Copy link
Member Author

hoyosjs commented Feb 13, 2021

Actually, this might have been correct all along. Having the native debugger attached with some breakpoint set might have been what tripped the FCALL. I let this run for a while under VS and see no crash, where as before it was in a sub-second manner.

Copy link
Member

@jkotas jkotas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. I have double checked it under debugger and I have convinced myself that this works fine.

I think this change is low-risk enough to be ported to servicing instead of the QCall changes.

@jkotas jkotas merged commit 9c5d363 into dotnet:master Feb 17, 2021
@ghost ghost locked as resolved and limited conversation to collaborators Mar 19, 2021
@hoyosjs hoyosjs deleted the juhoyosa/fix-stackwalk-x86 branch September 10, 2021 07:50
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
3 participants