-
Notifications
You must be signed in to change notification settings - Fork 18k
runtime: fpTracebackPartialExpand SIGSEGV under high panic load #73664
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
CC @golang/runtime |
This is happening while collecting a stack trace for the runtime block profiler (https://pkg.go.dev/runtime#SetBlockProfileRate). That is going to be inherently nondeterministic, both due to the set block profile rate, but also because a sample is only collected if the lock is actually contended. As a work around, setting the block profile rate to 0 should prevent these crashes by avoiding the One observation: The fault address looks suspiciously similar to the crash-reported frame pointers:
The A couple of questions:
I could believe that there is some bug with framepointer handling for deferred functions or injected panics (the nil dereference will cause a SIGSEGV where the signal handler injects a call to the panic handler). |
I think I have a reproducer. At least, it's one way to crash frame pointer unwinding on a nil-pointer dereference. Playground link: https://go.dev/play/p/Nz9GXUrHfAr Source: package main
import (
"os"
"runtime"
"runtime/pprof"
"time"
)
func main() {
runtime.SetBlockProfileRate(1)
ch := make(chan struct{})
time.AfterFunc(time.Second, func() { close(ch) })
defer func() {
if recover() != nil {
<-ch
}
pprof.Lookup("block").WriteTo(os.Stdout, 0)
}()
var p *ints
deref(p)
}
type ints [32]int
//go:noinline
func deref(p *ints) ints {
return *p
} Here's the output on the playground:
Interestingly, it doesn't outright crash on my M1 macbook, but the resulting profile is clearly not right:
I don't see |
Wow – the repro seems very promising. I really appreciate the proactive support here. Posting two new stack traces we were able to gather today, showing two different call paths from our DD library that ultimately lead to the same Interestingly, the second happened after we had hit a They both have wildly different
|
|
Indeed, these look like totally arbitrary values. The first one looks like it's probably ASCII. |
Thanks for sharing some more examples! Focusing on the first one, I was interested to see where the first panic is happening exactly. I'm especially interested in what the state of the frame would during the panicking access. The panic part is here:
I compiled a small program using that
From the crash traceback, we're at instruction offset
If I'm reading it right, we're faulting while dereferencing the stack pointer? I'm wondering how that's even possible... and it seems like the stack is okay-ish when we panic, since more code runs and uses (presumably) the same stack? This is quite strange. @john-markham in the other two crashes you've shared, the first panic is happening in code you've marked as redacted. Would you be able to share just a tiny bit of the disassembly, if you have the exact binaries that were running when you saw the crashes? Basically, if the traceback shows something like I'm starting to wonder if my "reproducer" is actually a separate issue, by the way. |
@nsrip-dd Here's much of the disassembly (for arm) of the
The exact line that panics is
At
The problematic dereference happens here I asked our security team about sending over the full stack trace via email so you have a clearer picture. |
Go version
go version go1.23.8 linux/arm64
(also happens ongo version go1.23.8 linux/amd64
)Output of
go env
in your module/workspace:Note: this happened during an incident in prod, where I have little ability to run
go env
. I've included the output from my local as we configure our go env vars similarly between local and our deployed environment. Sorry – this is the best I can do with what I have currently.What did you do?
Going to explain as if I'm walking up the stack trace included below:
In our GraphQL service, which is gqlgen based, one of our resolvers started panic'ing due to a nil pointer error:
A deferred function we use for span collection ran immediately after
runtime.panicmem()
:Our datadog library was attempting to begin the process of "finish"ing the span associated with the resolver.
It attempted to collect a lock belonging to the span, which somehow invoked a frame walk, which ended up triggering a SIGSEGV.
For convenience, here's the following links that correspond to each call that was run in this process, starting from the DD library:
https://github.com/DataDog/dd-trace-go/blob/v1.999.0-rc.27/ddtrace/internal/v2.go#L157
https://github.com/DataDog/dd-trace-go/blob/v2.1.0-dev.1/ddtrace/tracer/span.go#L664
https://github.com/DataDog/dd-trace-go/blob/v2.1.0-dev.1/ddtrace/tracer/span.go#L730
https://github.com/DataDog/dd-trace-go/blob/v2.1.0-dev.1/ddtrace/tracer/spancontext.go#L301
https://github.com/DataDog/dd-trace-go/blob/v2.1.0-dev.1/ddtrace/tracer/spancontext.go#L499
https://cs.opensource.google/go/go/+/refs/tags/go1.24.3:src/internal/sync/mutex.go;l=149 (hm, didn’t exist at go 1.23.8?... strange. anyways, the method called is
runtime_SemacquireMutex
)https://github.com/golang/go/blob/go1.23.8/src/runtime/sema.go#L95
https://github.com/golang/go/blob/go1.23.8/src/runtime/sema.go#L194
https://github.com/golang/go/blob/go1.23.8/src/runtime/mprof.go#L513
https://github.com/golang/go/blob/go1.23.8/src/runtime/mprof.go#L563
https://github.com/golang/go/blob/go1.23.8/src/runtime/mprof.go#L592
https://cs.opensource.google/go/go/+/master:src/runtime/signal_unix.go;l=909?q=signal_unix.go:909&sq=&ss=go%2Fgo
Interestingly, it does actually seem like this happened non-deterministically. Some requests seemed to be able to make it to our panic recovery mechanisms. Others, however, SIGSEGV'd and crashed our containers.
We did see #69629, which seems on the surface to be highly related if not the same exact issue. But unfortunately the go-metro FP clobbering issue would not apply here, as go-metro was not invoked at runtime by any of our code. We don't see any other potential misbehaving assembly that could clobber the FP.
We are able to reproduce this in our dev environment, but sadly not locally. Let me know what further information would be helpful to provide.
Tagging @nsrip-dd as he seems to have extensive expertise in issues similar or related to this (e.g. #61766)
What did you see happen?
What did you expect to see?
No SIGSEGV.
The text was updated successfully, but these errors were encountered: