Closed
Description
Please answer these questions before submitting your issue. Thanks!
What version of Go are you using (go version
)?
go version go1.9rc2 windows/amd64
What operating system and processor architecture are you using (go env
)?
set GOARCH=amd64
set GOBIN=
set GOEXE=.exe
set GOHOSTARCH=amd64
set GOHOSTOS=windows
set GOOS=windows
set GOPATH=C:\Users\kjk\src\go
set GORACE=
set GOROOT=C:\Go
set GOTOOLDIR=C:\Go\pkg\tool\windows_amd64
set GCCGO=gccgo
set CC=gcc
set GOGCCFLAGS=-m64 -mthreads -fmessage-length=0
set CXX=g++
set CGO_ENABLED=1
set CGO_CFLAGS=-g -O2
set CGO_CPPFLAGS=
set CGO_CXXFLAGS=-g -O2
set CGO_FFLAGS=-g -O2
set CGO_LDFLAGS=-g -O2
set PKG_CONFIG=pkg-config
What did you do?
This is a continuation of #20975 so the same repro program (https://github.com/kjk/go20975) built in 64bit mode.
What did you expect to see?
No infinite recursion.
What did you see instead?
This time I used https://github.com/kjk/cv2pdb to convert dwarf to pdb so that I can get symbols in windbg.
I ran repro program under windbg.
The crash is:
# RetAddr : Args to Child : Call Site
00 00000000`0043cc0b : 00000000`00451a56 00000000`0043cc0b 00000000`00451a56 00000000`0043cc0b : go20975!runtime.morestack+0x10 [C:\Go\src\runtime\asm_amd64.s @ 377]
01 00000000`00451a56 : 00000000`0043cc0b 00000000`00451a56 00000000`0043cc0b 00000000`00451a56 : go20975!runtime.sigpanic+0x18b [C:\Go\src\runtime\signal_windows.go @ 152]
02 00000000`0043cc0b : 00000000`00451a56 00000000`0043cc0b 00000000`00451a56 00000000`0043cc0b : go20975!runtime.morestack+0x26 [C:\Go\src\runtime\asm_amd64.s @ 382]
03 00000000`00451a56 : 00000000`0043cc0b 00000000`00451a56 00000000`0043cc0b 00000000`00451a56 : go20975!runtime.sigpanic+0x18b [C:\Go\src\runtime\signal_windows.go @ 152]
04 00000000`0043cc0b : 00000000`00451a56 00000000`0043cc0b 00000000`00451a56 00000000`0043cc0b : go20975!runtime.morestack+0x26 [C:\Go\src\runtime\asm_amd64.s @ 382]
05 00000000`00451a56 : 00000000`0043cc0b 00000000`00451a56 00000000`0043cc0b 00000000`00451a56 : go20975!runtime.sigpanic+0x18b [C:\Go\src\runtime\signal_windows.go @ 152]
06 00000000`0043cc0b : 00000000`00451a56 00000000`0043cc0b 00000000`00451a56 00000000`0043cc0b : go20975!runtime.morestack+0x26 [C:\Go\src\runtime\asm_amd64.s @ 382]
07 00000000`00451a56 : 00000000`0043cc0b 00000000`00451a56 00000000`0043cc0b 00000000`00451a56 : go20975!runtime.sigpanic+0x18b [C:\Go\src\runtime\signal_windows.go @ 152]
08 00000000`0043cc0b : 00000000`00451a56 00000000`0043cc0b 00000000`00451a56 00000000`0045104a : go20975!runtime.morestack+0x26 [C:\Go\src\runtime\asm_amd64.s @ 382]
09 00000000`00451a56 : 00000000`0043cc0b 00000000`00451a56 00000000`0045104a 00000000`004519ee : go20975!runtime.sigpanic+0x18b [C:\Go\src\runtime\signal_windows.go @ 152]
0a 00000000`0043cc0b : 00000000`00451a56 00000000`0045104a 00000000`004519ee 00000000`004304f0 : go20975!runtime.morestack+0x26 [C:\Go\src\runtime\asm_amd64.s @ 382]
0b 00000000`00451a56 : 00000000`0045104a 00000000`004519ee 00000000`004304f0 00000000`00b9fef0 : go20975!runtime.sigpanic+0x18b [C:\Go\src\runtime\signal_windows.go @ 152]
0c 00000000`0045104a : 00000000`004519ee 00000000`004304f0 00000000`00b9fef0 00000000`00000000 : go20975!runtime.morestack+0x26 [C:\Go\src\runtime\asm_amd64.s @ 382]
0d 00000000`004519ee : 00000000`004304f0 00000000`00b9fef0 00000000`00000000 00007ffa`59b6e618 : go20975!runtime.exitsyscallfast.func1+0xaa [C:\Go\src\runtime\proc.go @ 2717]
0e 00000000`004304f0 : 00000000`00b9fef0 00000000`00000000 00007ffa`59b6e618 00000000`00455804 : go20975!runtime.systemstack+0x7e [C:\Go\src\runtime\asm_amd64.s @ 347]
0f 00000000`00b9fef0 : 00000000`00000000 00007ffa`59b6e618 00000000`00455804 00000000`006307d8 : go20975!runtime.mstart [C:\Go\src\runtime\proc.go @ 1125]
10 00000000`00000000 : 00007ffa`59b6e618 00000000`00455804 00000000`006307d8 00000000`00b90e00 : 0xb9fef0
TEXT runtime·morestack(SB),NOSPLIT,$0-0
// Cannot grow scheduler stack (m->g0).
get_tls(CX)
MOVQ g(CX), BX
MOVQ g_m(BX), BX
MOVQ m_g0(BX), SI
CMPQ g(CX), SI
JNE 3(PC)
CALL runtime·badmorestackg0(SB)
INT $3
INT $3
is executed which triggers runtime.sigpanic
. I assume sigpanic does stack check, calls morestack and that does INT $3
again. Infite loop happens and eventually crash will happen.
Metadata
Metadata
Assignees
Labels
Type
Projects
Relationships
Development
No branches or pull requests
Activity
kjk commentedon Aug 10, 2017
So I'm stepping through the assembly and there's more fishy stuff.
After
int 3
we end up in:However, at that point rax is 0x17, so trying to de-reference [rax-75h] throws an exception:
That doesn't make sense to me unless this is a trick to just trigger an exception.
Here's a what gets executed, according to windbg, when single-stepping from int 3 to calling morestack again:
I don't get how executing:
ends up going to
go20975!runtime.morestack_noctxt
==00000000
00451ad0`.mvdan commentedon Aug 10, 2017
Just to clarify, is this when building the program, or when running it?
Does this happen with 1.8?
CC @aclements
mvdan commentedon Aug 10, 2017
Also, if this was an infinite recursion, wouldn't you end up with a panic or crash of some sort? I don't know what
windbg
is, so perhaps there's something I'm missing.kjk commentedon Aug 10, 2017
Eventually the process will go away due to stack overflow exception. In this scenario runtime is incapable of handling it and generating a proper panic.
alexbrainman commentedon Aug 11, 2017
I would not expect Go to generate proper panic after executing
INT $3
. I will let Austin decide if something needs to be done here.Alex
aclements commentedon Sep 5, 2017
I'd like to understand how we wound up in morestack without any remaining system stack space in the first place. Once we hit the INT $3, it would be nice to fail more gracefully, but things are toast anyway.
If there any way MSHTML could be calling back into Go code while deep in the stack?
If not, and I'm grasping at straws here, but my guess is that the C "syscall" code (which runs on the system stack) is running out of stack space, which invokes a Windows exception handler registered by the runtime, which also attempts to run on the system stack and fails when it sees there's no stack left. @alexbrainman, I know very little about how Windows exception handlers work; does this seem like a plausible explanation?
(Notably, on UNIX platforms, the signal handler runs on yet another stack that's only for signal handling, so even if we run out of space on the system stack, we have a little more backup room in which to fail gracefully.)
aclements commentedon Sep 5, 2017
Actually, this is sort of interesting, though I'm not sure what to make of it:
exitsyscallfast.func1
is specifically the closure that doesthrow("exitsyscall: syscall frame is no longer valid")
. This indicates that we tried to return from the system call (though why that would be, I'm not sure), but the stack got unwound or the SP just changed completely. Then, we tried to switch to the system stack to report this, but it was full, leading to a cascade of other problems.@kjk, can you put a breakpoint in
exitsyscall
at the first call tosystemstack
(inside theif
getcallersp(unsafe.Pointer(&dummy)) > g.syscallsp`) and see what the call stack there is?kjk commentedon Sep 5, 2017
@aclements Please also read comments #20975 (comment) and below as this is the same issue and there is more detail there.
To summarize my guesses at this point:
It's not caused by running out of stack space.
morestack
is called unconditionally (i.e. regardless of how much stack is left) by the closure passed tosystemstack
inexitsyscallfast.func1
.When
morestack
is called there's plenty of stack but it detects that it's being called on scheduler stack (g.m.g0 == g
) which shouldn't happen becausesystemstack
is supposed to ensure that it's, well, system stack. There seems to be a missed case in that logic.When
morestack
detects this invariant being violated, it does int 3 to trigger debugger and make debugging easy.It seems to be 64-bit only so I assume it's some of the arch-specific runtime assembly routines.
mshtml per se doesn't call Go but there are plenty of C->Go->C transitions because of how Windows message processing works.
Each window has a callback (called wndproc) responsible for handling message for that window. In Windows every control (a button, listview, browser view etc.) is a window.
To add custom handling of messages we need to provide our own wndproc callback, which must be called via C->Go trampoline. When that callback is not interested in the message, we need to call the original wndproc for that message, which is Go->C transition.
So every GUI windows program has a high rate of C->Go and Go->C transitions, especially those using https://github.com/lxn/walk/ library, as it hoooks wndproc for all windows it creates.
This also makes debugging with breakpoints impossible. I've spent several hours setting breakpoints at various points and stepping through the code but the same code works correctly the first 1000 times and then fails.
To summarize my beliefs:
systemstack
failing to switch to system stack before calling its closure and remaining on scheduler stackThe most promising approach would be to instrument
systemstack
to add the same check thatmorestack
does but when exitingsystemstack
, to catch bad condition (remaining on scheduler stack) earlier.aclements commentedon Sep 5, 2017
@kjk,
systemstack
is extremely well-trodden code. Obviously it's not impossible that it contains a bug, but that's way down on my list of candidates.Why do you say that? It never makes sense to call morestack unconditionally, and, looking at the disassembly of
exitsyscallfast.func1
, it clearly does check the stack bound before callingmorestack
, as it's supposed to.This isn't quite right. There is no separate "scheduler stack", there's just the user stack and the system stack (and the signal stack on UNIX). If
g.m.g0 == g
, then you're on the system stack. So,systemstack
is supposed to put you on the system stack, at which pointg.m.g0 == g
, and any call tomorestack
should panic.What makes you say there's plenty of stack when it calls
morestack
? I didn't see evidence for that here or on the other issue (I may have just missed it; there are a lot of posts).Can you point me to where your code is doing this? Normally this would go through the cgo callback paths, but since your application isn't using cgo, I'm curious how this is being done.
Given the C->Go callbacks, this is all precisely the behavior I would expect if C code were using up the system stack and then calling back into Go code.
(From my earlier post:)
Oops, I'd missed the
fast
in there, so I was looking at the wrong closure. Unfortunately, I would expectexitsyscallfast.func1
to be called quite frequently in normal operation, so setting a breakpoint there isn't useful. (But it does mean the SP probably isn't getting totally trashed like I thought.)kjk commentedon Sep 5, 2017
Like I said, those are guesses, you're more likely to be right than me.
I'm just parroting back terminology used by the code e.g.
go/src/runtime/asm_arm64.s
Line 271 in 3216e0c
I've tried the repro with ridiculously large (16 MB) stack and got the same thing.
In the debugger, I printed the callstack and it was relatively short from main().
Either way, this particular issue is due to
morestack
detecting an internal inconsistency (and not being able to handle it via somewhat controlled panic which eventually triggering windows exception that silently kills the process).It's also consistent with being confused about which stack the code is on.
If the code is confused about which stack it is on, then we might be on a thread with plenty of stack but "needs to grow stack" check is done on the wrong stack, wrongly detects need to expand stack, calls
morestack
which detects it's the wrong stack and doesint 3
.On windows
syscall.Syscal
does Go->C call andsyscall.NewCallback
creates C->Go callback.This is all done in the lnx/walk library:
Windows GUI code is roughly this (https://github.com/lxn/walk/blob/2d327b4a1aba7cda2a365bc566fd60ea6bd4c8bf/form.go#L365):
It's unavoidable to get Go -> C -> Go -> C in Windows GUI programs. Using cgo is not necessary for that.
alexbrainman commentedon Sep 6, 2017
Windows exception handler calls runtime.sigtramp. The runtime.sigtramp will run on scheduler stack. Also see _StackSystem is used to make sure we always have enough room to run exception handler.
If you are interested to see simple Windows GUI app, you can download d8b239ff60a62c3f50f7eb5994221b50ba055cf2 commit (initial commit) of https://github.com/alexbrainman/gowingui
Alex
[-]Runtime infinite recursion on windows triggered by morestack[/-][+]runtime: infinite recursion on windows triggered by morestack[/+]18 remaining items
kjk commentedon Jul 4, 2018
BTW: the same problem happens on 386.
kjk commentedon Jul 4, 2018
Another observation: memoryBasicInformation.baseAddress and memoryBasicInformation.allocationBase is shifted by 0x1000 from TEB StackBase and StackLimit (i.e. StackLimit is 0xa1000 and allocationBase is 0xa0000).
g0.stack.lo is allocationBase + 0x2000 which explains 0x1000 difference (0x2000 - 0x1000) difference between g0.stack.lo and TEB.StackBase.
TEB is https://en.wikipedia.org/wiki/Win32_Thread_Information_Block
But not sure if that's relevant. I patched minit() with:
to make them match and that didn't change anything.
kjk commentedon Jul 4, 2018
Changing slack value from 8*1024 to 16*1024 (
base := mbi.allocationBase + 8*1024)
fixes the stack overflow.runtime.morestack
gets called, printsfatal: morestack on g0
message, does int 3 which invokes exception handler.However, things then get recursive i.e. code in exception handler will call
runtime.morestack
etc.I've added
//go:nosplit
from @aclements PR, then I've added some more that are called within exception handler (e.g. traceback(), gettraceback(), findfunc()) and then I've hit the limit on those:kjk commentedon Jul 4, 2018
And here's a fix:
Instead of trying to remove implicit calls to
morestack
by adding//go:nosplit
I just made it believe that everything is ok by using the slack we've added in minit().With this change I get the proper clean exit with callstacks printed:
aclements commentedon Jul 5, 2018
Thanks for the great debugging @kjk! You've definitely found the root of the problem: the initial INT3 traps fine and we detect that there's a problem, but since the stack bounds aren't quite right we wind up walking off the edge of the actual stack and the subsequent failures are a different exception, which we don't handle so carefully.
It seems like we should perhaps use the TIB instead of VirtualQuery to get the stack bounds (I'm not sure why I didn't come across that when originally figuring out how to get the stack bounds), and do something to make room for handling the exception (like the slack you added in
exceptionhandler
).To answer some of your other questions, which you may have already found the answers to:
Unlike goroutine stacks, this stack is allocated by Windows itself when we create the thread.
The
PAGE_GUARD
is definitely a problem. Apparently theVirtualQuery
call we use to find the bounds of the stack considers that to be part of the mapping, even though we can't actually use that memory. That's what causes the runtime to set up the wrong stack bounds.There's nothing in the Go runtime that commits more stack or in any way extends a system stack. The OS commits more stack memory as we touch it, but that's transparent to Go.
That's not a bad idea, though it needs to be done a little more carefully. :)
I've been trying to figure out what stack the vectored exception handler runs on when it's handling a stack overflow exception without much luck. The closest I've come is https://stackoverflow.com/questions/1897301/vectored-exception-handling-during-stackoverflowexception, but that could mean the OS reserves some small dedicated stack for this purpose, or that it lets the stack grow into the
PAGE_GUARD
region for this purpose. Either way, it's probably better if we just completely avoid overrunning the stack.aclements commentedon Jul 5, 2018
Sigh. Apparently the StackLimit field in the TIB gives the limit of the committed stack, not the reserved stack, so that's not useful. There's a later field with the "Address of memory allocated for stack" but that returns the same base address as
VirtualQuery
.Based on https://docs.microsoft.com/en-us/windows/desktop/Memory/creating-guard-pages, this is bit a different from the guard pages I'm used to. Apparently Windows will let you use that memory, but only after the process has handled a
STATUS_GUARD_PAGE_VIOLATION
exception.gopherbot commentedon Jul 6, 2018
Change https://golang.org/cl/122515 mentions this issue:
runtime: fix abort handling on Windows
gopherbot commentedon Jul 6, 2018
Change https://golang.org/cl/122516 mentions this issue:
runtime: account for guard zone in Windows stack size
runtime: fix abort handling on Windows
runtime: account for guard zone in Windows stack size