Stackwalk support for interpreted frames #113900

janvorli · 2025-03-25T21:40:08Z

This change adds support for walking stack with interpreter frames present. The StackFrameIterator now considers the interpreter frames to be frameless frames like regular JITted / crossgened folder. This is a basis for GC, EH and debugger stack walking. The SOS clrstack and clrstack -i both display correct stack trace with interpreter frames on the stack with this change.

It required changing the GetCallerSP and EnsureCallerContextIsValid to become virtual methods so that the right variant for the interpreter or JITted / AOTed code can be invoked. Before, they were both static methods on the EECodeManager.

I have also added some extra handling of the case when the interpreter frames are on top of the stack so that the clrstack can dump them.

Copilot

Copilot reviewed 1 out of 20 changed files in this pull request and generated no comments.

Files not reviewed (19)

src/coreclr/debug/daccess/dacdbiimplstackwalk.cpp: Language not supported
src/coreclr/debug/daccess/stack.cpp: Language not supported
src/coreclr/debug/inc/dacdbistructures.inl: Language not supported
src/coreclr/inc/eetwain.h: Language not supported
src/coreclr/vm/FrameTypes.h: Language not supported
src/coreclr/vm/codeman.cpp: Language not supported
src/coreclr/vm/codeman.h: Language not supported
src/coreclr/vm/eetwain.cpp: Language not supported
src/coreclr/vm/eventtrace.cpp: Language not supported
src/coreclr/vm/exceptionhandling.cpp: Language not supported
src/coreclr/vm/frames.cpp: Language not supported
src/coreclr/vm/frames.h: Language not supported
src/coreclr/vm/gcinfodecoder.cpp: Language not supported
src/coreclr/vm/interpexec.cpp: Language not supported
src/coreclr/vm/interpexec.h: Language not supported
src/coreclr/vm/jitinterface.h: Language not supported
src/coreclr/vm/prestub.cpp: Language not supported
src/coreclr/vm/stackwalk.cpp: Language not supported
src/coreclr/vm/threadsuspend.cpp: Language not supported

dotnet-policy-service · 2025-03-25T21:40:51Z

Tagging subscribers to this area: @mangod9
See info in area-owners.md if you want to be subscribed.

janvorli · 2025-03-25T21:41:56Z

cc: @BrzVlad, @cshung

src/coreclr/vm/exceptionhandling.cpp

src/coreclr/debug/daccess/dacdbiimplstackwalk.cpp

jkotas · 2025-03-25T22:29:18Z

src/coreclr/vm/FrameTypes.h

 FRAME_TYPE_NAME(DebuggerU2MCatchHandlerFrame)
 FRAME_TYPE_NAME(ExceptionFilterFrame)
+#ifdef FEATURE_INTERPRETER
+FRAME_TYPE_NAME(InterpreterEntryFrame)


Do we need really to have both entry and exit frames? I would expect that one InterpreterFrame should be enough. We will know that we have exited the interpreter if we find some other Frame or managed code first on the stack.

I was considering that and this seems to work better with the stack walker. We could possibly have just the InterpreterEntryFrame, but then upon reaching it, we would immediately need to switch to the underlying interpreter frames and then once we get out of them, hit the InterpreterEntryFrame again and copy the transition record to the context to move to the caller of the sequence of interpreted frames. Also, the InterpreterEntryFrame doesn't have a direct pointer to the topmost frame in the InterpExecMethod, Vlad was concerned about updating that on every interpreter to interpreter call.

Are the interpreter -> AOT code calls going to push InterpreterExitFrame?

I think we should be optimizing for making the interpreter -> AOT calls as cheap as possible.

Yes, those would need it too. We can pass current thread as a parameter to the InterpMethodFrame and then use it to push that frame without the thread local access overhead.

So the purpose of the InterpreterEntryFrame is to store the context so can unwind to AOT callers, while the purpose of the InterpreterExitFrame is to make it cleaner to access the top most interpreter frame ? Would we always resume execution in the interpreter loop via a try/catch ? So if EH determines that an interpreter method caught the exception, in order to resume in the interpreter it would throw a C++ exception that will get caught at this interpreter exit location ?

jkotas · 2025-03-25T22:35:12Z

src/coreclr/vm/frames.cpp

+    PTR_InterpMethodContextFrame pFrame = m_pInterpMethodContextFrame;
+    _ASSERTE(pFrame != NULL && pFrame->ip != NULL);
+
+    // The frames of a method are linked in a reverse order (from bottom to top of the part of the stack)


Is there an advantage in this reverse order? I would expect the interpreter frame to point to what we are executing.

Vlad was concerned about updating such a pointer on every interpreter to interpreter call. The current call implementation is pretty efficient and according to his experience from Mono, this would add noticeable overhead.

Moreover, this reverse walk is only needed in the native debugger on desktop scenario when someone has the debugger sitting in the middle of the InterpExecMethod, nothing else needs to walk that way (provided we have the two interpreter frame kinds).

Vlad was concerned about updating such a pointer on every interpreter to interpreter call.

We do not need to update the pointer on every interpreter-to-interpreter call. We only need to update the pointer when we exit the interpreter. ie instead of pushing the full Frame to exit the interpreter, we would just need to patch the current IP or current method in the Interpreter frame.

I think that since only the native debugger needs to walk this list and it would need to walk it even if I changed it the way you suggested in case it was sitting in the InterpExecMethod and someone has invoked the clrstack command, it doesn't seem really beneficial to store it. Unless we moved to the single interpreter frame way, which would introduce the unusual double handling of single explicit frame to the StackFrameIterator or something similar to that.
What is your concern about having two separate explicit frames?

I think the scheme looks more expensive than it needs to be ~~and non-intuitive (the linked list is in wrong direction)~~. EDIT: I see that it is double linked list so it has links in the right direction too.

Yes, the direction I am using for this specific native debugger scenario is primarily used for the interpreter internal purposes, I have just reused it for these rare scenarios. This is not used for managed debuggers, GC or EH. In those cases we would always have the interpreter exit frame.

src/coreclr/vm/prestub.cpp

janvorli · 2025-03-29T00:20:16Z

@jkotas I have made changes based on your feedback. There is now a single InterpreterFrame and it keeps pointer to the top frame of the related InterpExecMethod. That pointer is updated whenever the InterpExecMethod calls something external where runtime can be suspended. To enable it to work with SOS when debugging dumps or live debugging and the interpreter running in the InterpExecMethod without leaving it, we still search for the real top starting at the value that's stored there. In the EH, GC and managed debugger cases, the pointer will always be the right one.

janvorli · 2025-03-29T00:21:46Z

The CI test failures are unrelated.

src/coreclr/inc/eetwain.h

src/coreclr/vm/gcinfodecoder.cpp

jkotas · 2025-03-30T01:22:13Z

src/coreclr/vm/prestub.cpp

-    InterpExecMethod(&interpFrame, threadContext);
+    InterpreterFrame interpreterFrame(pTransitionBlock, &interpMethodContextFrame);
+
+    InterpExecMethod(&interpreterFrame, &interpMethodContextFrame, threadContext);


This change looks ok for now. Do we have a plan for where and how the arguments are going to converted from the TransitionBlock to the interpreter representation?

I'm thinking when invoking an interpreted method through the precode thunks, instead of calling ExecuteInterpretedMethod like we do now, it will call some other assembly thunk that receives a lower level bytecode referenced from the interpreter bytecode (maybe in some Interpreter code header). These assembly thunks will move arguments from the transition block to the interpreter stack, finally dispatching to ExecuteInterpreterMethod. I described this mechanism a while ago in a doc. Last time I spoke with @janvorli, the plan was to use this mechanism from the start since we will need it as a fallback anyway on iOS.

I've used the TransitionBlock here for two reasons. One, as a possibly easy starting point for passing arguments until we implement the proper mechanism and second, as a mean to save and restore the callee saved registers. This is necessary as we need to be able to restore them during stack walking from interpreted to JITted / AOTed methods. Once we have the optimized mechanism, we can use a smaller structure than the TransitionBlock that would contain only the callee saved registers.

This change adds support for walking stack with interpreter frames present. The StackFrameIterator now considers the interpreter frames to be frameless frames like regular JITted / crossgened folder. This is a basis for GC, EH and debugger stack walking. The SOS `clrstack` and `clrstack -i` both display correct stack trace with interpreter frames on the stack with this change. I have added some extra handling of the case when the interpreter frames are on top of the stack so that the clrstack can dump them.

In case we were generating the caller context, the SP and ContextFlags were missing.

* Change the GET_CALLER_SP to assert and return NULL * Remove extra assert at the call to GET_CALLER_SP * Revert some changes removing "virtual" keyword that were not intentional

BrzVlad · 2025-03-31T18:55:34Z

src/coreclr/vm/frames.cpp

+        }
+    }
+    else
+    {


~~The iteration here seems weird. I don't see why we would ever reach this path, given we always start for from the "list head". Why would we ever need to iterate backwards.~~ Nvm, I noticed this is actually the top and we set it when leaving interp

jkotas · 2025-03-31T22:50:11Z

src/coreclr/vm/interpexec.cpp

                    MethodDesc *pMD = (MethodDesc*)(targetMethod & ~INTERP_METHOD_DESC_TAG);
                    PCODE code = pMD->GetNativeCode();
                    if (!code) {
+                        pInterpreterFrame->SetTopInterpMethodContextFrame(pFrame);


If I understand this correctly, this is just an optimization and not required for correctness. Is that correct? It may be worth mentioning it in a comment.

Does the setting need to be undone when the call returns?

It is not clear to me whether this optimization is worth it. It may be better to do it during stackwalk - so that repeated stackwalks from approximately the same spot avoid walking the whole list.

Yes, it would still work correctly without it, so it can be viewed as an optimization.

It doesn't need to be undone when the call returns:

for the SOS scenario, we would seek for the top frame going in one way or the other through the linked list from the last value stored there. It doesn't matter if we start on a frame that's in the middle of the "substack" belonging to the current InterpExecMethod or if we start on an inactive frame that's somewhere by the end of the list. Based on the inactive vs active state of the frame (ip being 0 or non-zero), we know which way to seek for the last active frame.

for other scenarios, we would maintain the exact value so that we don't waste time scanning the list.

Since this optimization is just one store, I think it is worth doing that instead of updating it during the stack walk. Imagine a case with a deep recursion within interpreted frames only. It would cause a lot of seeking the first time we do a stack walk which is completely unnecessary.
We will need to do other stuff while leaving the interpreter, like catching exceptions from the native code and calling the DispatchManagedException, so this is going to be just one of the things to do. I am planning to add a macro named "EXIT_INTERPRETER_BEGIN/END or something along those lines to contain that stuff.

It would cause a lot of seeking the first time we do a stack walk which is completely unnecessary.

This work is small and proportional to the total cost of stack walk (assuming we typically walk large part of the stack).

I am looking at it as "total number of instructions executed" in given scenario. I think updating the pointer in the stackwalker would be better on average in this metric.

This is a micro-optimization. It is fine to keep this as is. Comment would be nice.

We will need to do other stuff while leaving the interpreter, like catching exceptions from the native code and calling the DispatchManagedException, so this is going to be just one of the things to do.

Yep. I expect that the total cost of this other stuff will make us consider building the interpreter executor in C# eventually - but that is for future discussion.

I've added a comment

kg · 2025-04-01T18:41:12Z

src/coreclr/vm/eetwain.cpp

+    {
+        // We already have the caller's frame context
+        // We just switch the pointers
+        PT_CONTEXT temp      = pRD->pCurrentContext;


Isn't unwinding the process of walking up the stack? Why are we switching the pointers instead of just overwriting the current context with the caller context? I assume I'm missing something here

If we already have the caller context available, then you can just set the current context to it, there is no need to call unwind again to get the same values. The switching of pointer is to avoid copying in this specific case.
There are cases when we want to know the caller context without actually unwinding, so we call EnsureCallerContextIsValid. That's where the caller context gets created and then it can get reused here.

kg

LGTM

janvorli · 2025-04-01T19:14:29Z

/ba-g the test failures in this PR are happening on all PRs.

janvorli added the area-VM-coreclr label Mar 25, 2025

janvorli added this to the 10.0.0 milestone Mar 25, 2025

janvorli requested review from kg and jkotas March 25, 2025 21:40

janvorli self-assigned this Mar 25, 2025

Copilot AI review requested due to automatic review settings March 25, 2025 21:40

Copilot AI reviewed Mar 25, 2025

View reviewed changes

kg reviewed Mar 25, 2025

View reviewed changes

src/coreclr/vm/exceptionhandling.cpp Show resolved Hide resolved

jkotas reviewed Mar 25, 2025

View reviewed changes

src/coreclr/debug/daccess/dacdbiimplstackwalk.cpp Outdated Show resolved Hide resolved

jkotas reviewed Mar 25, 2025

View reviewed changes

kg mentioned this pull request Mar 26, 2025

Interpreter GC info stage 2: Generate empty GC info and add baseline implementation of EnumGcRefs #113948

Merged

BrzVlad reviewed Mar 27, 2025

View reviewed changes

src/coreclr/vm/prestub.cpp Outdated Show resolved Hide resolved

janvorli mentioned this pull request Mar 7, 2025

CoreCLR Interpreter - CoreCLR support #112742

Closed

4 tasks

janvorli force-pushed the stack-walk-interpreter-support-work branch from dd98e2b to 62f3514 Compare March 28, 2025 00:39

This was referenced Mar 28, 2025

System.Net.Quic tests timeout #107761

Closed

System.Net.Requests test timeout #113883

Closed

build-analysis bot mentioned this pull request Mar 29, 2025

/root/helix/work/correlation/scripts/<hash>/execute.sh: Permission denied dotnet/dnceng#3412

Open

3 tasks

jkotas reviewed Mar 30, 2025

View reviewed changes

src/coreclr/inc/eetwain.h Show resolved Hide resolved

jkotas reviewed Mar 30, 2025

View reviewed changes

src/coreclr/vm/gcinfodecoder.cpp Show resolved Hide resolved

jkotas reviewed Mar 30, 2025

View reviewed changes

src/coreclr/vm/gcinfodecoder.cpp Outdated Show resolved Hide resolved

jkotas reviewed Mar 30, 2025

View reviewed changes

janvorli added 4 commits March 31, 2025 14:47

Fix VirtualUnwindInterpreterCallFrame

b459586

In case we were generating the caller context, the SP and ContextFlags were missing.

Move to single InterpreterFrame and few fixes

27cbf9f

Fix Unix build break

940dc3c

janvorli added 3 commits March 31, 2025 14:47

Prevent missing override errors on Unix build

565fffa

Improve top frame lookup

bb4af84

Reflect PR feedback

a257f58

* Change the GET_CALLER_SP to assert and return NULL * Remove extra assert at the call to GET_CALLER_SP * Revert some changes removing "virtual" keyword that were not intentional

janvorli force-pushed the stack-walk-interpreter-support-work branch from 5d43a67 to a257f58 Compare March 31, 2025 12:47

janvorli added 3 commits March 31, 2025 14:51

Few missing virtual keywords reverts

6122bae

Revert the GetCallerSP changes

5ba8a0e

Fix MUSL build break

559885d

BrzVlad reviewed Mar 31, 2025

View reviewed changes

BrzVlad approved these changes Mar 31, 2025

View reviewed changes

This was referenced Mar 31, 2025

[QUIC & HTTP/3] Handshake Timeout on tests #104426

Closed

System.OperationCanceledException : The operation was canceled. dotnet/dnceng#5278

Closed

System.TimeoutException : The operation has timed out. dotnet/dnceng#5279

Closed

jkotas reviewed Mar 31, 2025

View reviewed changes

Add comment on top frame seeking

e4c90ab

kg reviewed Apr 1, 2025

View reviewed changes

kg approved these changes Apr 1, 2025

View reviewed changes

janvorli merged commit fa45aa5 into dotnet:main Apr 1, 2025
96 of 98 checks passed

janvorli deleted the stack-walk-interpreter-support-work branch April 1, 2025 19:14

github-actions bot locked and limited conversation to collaborators May 2, 2025

Stackwalk support for interpreted frames #113900

Stackwalk support for interpreted frames #113900

Uh oh!

Conversation

janvorli commented Mar 25, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

dotnet-policy-service bot commented Mar 25, 2025

Uh oh!

janvorli commented Mar 25, 2025

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

BrzVlad Mar 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jkotas Mar 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jkotas Mar 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

janvorli commented Mar 29, 2025

Uh oh!

janvorli commented Mar 29, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

BrzVlad Mar 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

BrzVlad Mar 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kg left a comment

Choose a reason for hiding this comment

Uh oh!

BrzVlad Mar 27, 2025 •

edited

Loading

jkotas Mar 25, 2025 •

edited

Loading

jkotas Mar 25, 2025 •

edited

Loading

BrzVlad Mar 30, 2025 •

edited

Loading

BrzVlad Mar 31, 2025 •

edited

Loading