-
Notifications
You must be signed in to change notification settings - Fork 18k
runtime: potential gc deadlock #68373
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
cc @mknyszek: We're still continuing our investigation on this. Feel free to ignore this since it's for go1.20.x. But I figured we should raise this on the off-chance that you know of any GC locking issues that have been fixed in go1.21+ or can give some advice on how we could rule out cgo related heap corruption. 🙇 @gabyhelp good bot! I looked through all of those, and they seem unrelated. That being said, I noticed that #61428 talks about contention in |
Our investigation is increasingly pointing to a kernel bug as the root cause here. Other processes seem to be failing on the same host as well. I'm leaving this open until we have fully confirmed it, but at this point I don't think this needs to be investigated by the Go team. |
Our investigation has confirmed that this was caused by the aforementioned kernel bug. Closing this. Sorry for the noise. |
Note: I'm posting this issue on behalf of a colleague who did the analysis below.
Go version: 1.20.10.
Go env: N/A
But we know the following about the
linux/amd64
VM this happens on:We're seeing the same issue with a 96 core machine.
What did you do?
We are investigating an issue where Go garbage collector appears to get deadlocked. The Go program either starts to use more and more memory, or just becomes inoperable.
We collected debugging information and core dumps from the affected
processes.
The process is using CGO extensively, and we haven't ruled out heap corruption or possible hardware issues yet, but we are sharing this early in case this seems familiar to anyone.
What did you see happen?
Investigating the reports from a customer, we found that
LastGC
value (as reported by memstats published via expvars) can be hours behind the current time, andNextGC
≪HeapAlloc
, for example, from a snapshot taken onWed Jun 19 20:34:24 UTC 2024
:We obtained some core dumps from the affected instances of the program and found that the GC in them was active, but blocked in different places, on goroutines executing the same tight loop in the runtime
spanSet#pop
:https://github.com/golang/go/blob/go1.20.10/src/runtime/mspanset.go#L189-L195
One example, from a case where GC STW phase was waiting on
spanSet#pop
:All other goroutines appear to be sleeping, as would be expected at this stage in
stopTheWorldWithSema
.In another case, we found a
gcBgMarkWorker
stuck on the same issue:This case appeared as leaking memory: parts of the program continued to run, allocating memory, but the GC was not running, waiting on goroutines stuck in the same place:
We were able to confirm that the GC was indeed deadlocked, and we didn't get lucky with the moment of the snapshot.
Monotonic time when the GC was started:
A timestamp from a goroutine that triggered
spanSet#pop
call provides a reference time:And from another part in the program that didn't get blocked by the GC, we can find the current time:
In this case, the GC appears to have been stuck for over 3 days.
What did you expect to see?
spanSet#pop
executes quickly and doesn’t block the GC.The text was updated successfully, but these errors were encountered: