Skip to content

runtime: potential gc deadlock #68373

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
felixge opened this issue Jul 10, 2024 · 4 comments
Closed

runtime: potential gc deadlock #68373

felixge opened this issue Jul 10, 2024 · 4 comments
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@felixge
Copy link
Contributor

felixge commented Jul 10, 2024

Note: I'm posting this issue on behalf of a colleague who did the analysis below.

Go version: 1.20.10.

Go env: N/A

But we know the following about the linux/amd64 VM this happens on:

OS: SLES12 (SUSE Linux Enterprise Server 12) SP3
Linux Kernel: 4.4 (custom built)
Core count: 80
Memory Size: 520 GB

We're seeing the same issue with a 96 core machine.

What did you do?

We are investigating an issue where Go garbage collector appears to get deadlocked. The Go program either starts to use more and more memory, or just becomes inoperable.

We collected debugging information and core dumps from the affected
processes.

The process is using CGO extensively, and we haven't ruled out heap corruption or possible hardware issues yet, but we are sharing this early in case this seems familiar to anyone.

What did you see happen?

Investigating the reports from a customer, we found that LastGC value (as reported by memstats published via expvars) can be hours behind the current time, and NextGCHeapAlloc, for example, from a snapshot taken on Wed Jun 19 20:34:24 UTC 2024:

HeapAlloc: 2.9148435376e+10
NextGC: 7.6087616e+07
LastGC: 1.7187810082630615e+18  # Wed Jun 19 07:10:08 UTC 2024

We obtained some core dumps from the affected instances of the program and found that the GC in them was active, but blocked in different places, on goroutines executing the same tight loop in the runtime spanSet#pop:

https://github.com/golang/go/blob/go1.20.10/src/runtime/mspanset.go#L189-L195

One example, from a case where GC STW phase was waiting on spanSet#pop:

(dlv) grs
Goroutine 2 - User: /usr/local/go/src/runtime/sys_linux_amd64.s:555 runtime.futex (0x23d0043) (thread 35209)
(dlv) grs -s
Goroutine 2 - Start: /usr/local/go/src/runtime/proc.go:296 runtime.forcegchelper (0x23982a0) (thread 35209)
(dlv) gr 2
(dlv) bt
0  0x00000000023d0043 in runtime.futex
   at /usr/local/go/src/runtime/sys_linux_amd64.s:555
1  0x000000000239172f in runtime.futexsleep
   at /usr/local/go/src/runtime/os_linux.go:75
2  0x0000000002366905 in runtime.notetsleep_internal
   at /usr/local/go/src/runtime/lock_futex.go:202
3  0x0000000002366a54 in runtime.notetsleep
   at /usr/local/go/src/runtime/lock_futex.go:225
4  0x000000000239aa36 in runtime.stopTheWorldWithSema
   at /usr/local/go/src/runtime/proc.go:1298
5  0x00000000023cbf69 in runtime.systemstack
   at /usr/local/go/src/runtime/asm_amd64.s:496
6  0x00000000023cbf00 in runtime.systemstack_switch
   at /usr/local/go/src/runtime/asm_amd64.s:463
7  0x00000000023766dc in runtime.gcStart
   at /usr/local/go/src/runtime/mgc.go:665
8  0x0000000002398305 in runtime.forcegchelper
   at /usr/local/go/src/runtime/proc.go:311
9  0x00000000023ce121 in runtime.goexit
(dlv) p sched.stopwait
3
(dlv) grs -r -with curloc mspanset
  Goroutine 661 - Runtime: /usr/local/go/src/runtime/mspanset.go:194 runtime.(*spanSet).pop (0x238ebc2) (thread 35212)
  Goroutine 662 - Runtime: /usr/local/go/src/runtime/mspanset.go:194 runtime.(*spanSet).pop (0x238ebc2) (thread 35249)
  Goroutine 822 - Runtime: /usr/local/go/src/runtime/mspanset.go:194 runtime.(*spanSet).pop (0x238ebc2) (thread 35210)
[3 goroutines]

All other goroutines appear to be sleeping, as would be expected at this stage in stopTheWorldWithSema.

In another case, we found a gcBgMarkWorker stuck on the same issue:

(dlv) grs
  Goroutine 63 - User: /usr/local/go/src/runtime/sys_linux_amd64.s:555 runtime.futex (0x23d0043) (thread 3799) [unknown wait reason 30]
(dlv) grs -s
  Goroutine 63 - Go: /usr/local/go/src/runtime/mgc.go:1199 runtime.gcBgMarkStartWorkers (0x2377be5) (thread 3799) [unknown wait reason 30]
(dlv) gr 63
Switched from 645 to 63 (thread 3799)
(dlv) bt
 0  0x00000000023d0043 in runtime.futex
    at /usr/local/go/src/runtime/sys_linux_amd64.s:555
 1  0x000000000239172f in runtime.futexsleep
    at /usr/local/go/src/runtime/os_linux.go:75
 2  0x0000000002366905 in runtime.notetsleep_internal
    at /usr/local/go/src/runtime/lock_futex.go:202
 3  0x0000000002366a54 in runtime.notetsleep
    at /usr/local/go/src/runtime/lock_futex.go:225
 4  0x000000000239b42b in runtime.forEachP
    at /usr/local/go/src/runtime/proc.go:1718
 5  0x00000000023c4e07 in runtime.gcMarkDone.func1
    at /usr/local/go/src/runtime/mgc.go:814
 6  0x00000000023cbf69 in runtime.systemstack
    at /usr/local/go/src/runtime/asm_amd64.s:496
 7  0x00000000023cbf00 in runtime.systemstack_switch
    at /usr/local/go/src/runtime/asm_amd64.s:463
 8  0x0000000002376b1f in runtime.gcMarkDone
    at /usr/local/go/src/runtime/mgc.go:807
 9  0x0000000002377f85 in runtime.gcBgMarkWorker
    at /usr/local/go/src/runtime/mgc.go:1407
10  0x00000000023ce121 in runtime.goexit
    at /usr/local/go/src/runtime/asm_amd64.s:1598

This case appeared as leaking memory: parts of the program continued to run, allocating memory, but the GC was not running, waiting on goroutines stuck in the same place:

(dlv) p runtime.sched.safePointWait
2
(dlv) p runtime.allp[14].runSafePointFn
1
(dlv) p runtime.allp[21].runSafePointFn
1
(dlv) p (*runtime.m)(runtime.allp[14].m).procid
80708
(dlv) p (*runtime.m)(runtime.allp[21].m).procid
5220
(dlv) threads
  Thread 5220 at 0x238ebc2 /usr/local/go/src/runtime/mspanset.go:194 runtime.(*spanSet).pop
  Thread 80708 at 0x238ebc2 /usr/local/go/src/runtime/mspanset.go:194 runtime.(*spanSet).pop

We were able to confirm that the GC was indeed deadlocked, and we didn't get lucky with the moment of the snapshot.

Monotonic time when the GC was started:

(dlv) frame 9
Frame 9: /usr/local/go/src/runtime/mgc.go:1407 (PC: 2377f85)
(dlv) p startTime - time.startNano
423405108916860

A timestamp from a goroutine that triggered spanSet#pop call provides a reference time:

(dlv) thread 5220
(dlv) frame 9
> runtime.(*spanSet).pop() /usr/local/go/src/runtime/mspanset.go:194 (PC: 0x238ebc2)
Warning: debugging optimized function
Frame 9: /omnibus/src/datadog-agent/src/github.com/DataDog/datadog-agent/pkg/aggregator/aggregator.go:594 (PC: 56aea45)
(dlv) args
agg = ("*github.com/DataDog/datadog-agent/pkg/aggregator.BufferedAggregator")(0xc000ede420)
start = time.Time(2024-06-28T08:56:12-04:00, +423405107637349){wall: 13949841138539087330, ext: 423405107637349, loc: (*time.Location)(0xbb67ae0)}
(dlv) p 423405108916860 - 423405107637349
1279511

And from another part in the program that didn't get blocked by the GC, we can find the current time:

(dlv) p "github.com/DataDog/datadog-agent/pkg/status/health".readinessAndLivenessCatalog.latestRun
time.Time(2024-07-01T21:59:12Z, +715185107371868){
        wall: 13950154434928228569,
        ext: 715185107371868,
(dlv) p (715185107371868 - 423405107637349) / 1e9 / 86400
3.3770833302606365

In this case, the GC appears to have been stuck for over 3 days.

What did you expect to see?

spanSet#pop executes quickly and doesn’t block the GC.

@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Jul 10, 2024
@felixge
Copy link
Contributor Author

felixge commented Jul 10, 2024

cc @mknyszek: We're still continuing our investigation on this. Feel free to ignore this since it's for go1.20.x. But I figured we should raise this on the off-chance that you know of any GC locking issues that have been fixed in go1.21+ or can give some advice on how we could rule out cgo related heap corruption. 🙇

@gabyhelp good bot! I looked through all of those, and they seem unrelated. That being said, I noticed that #61428 talks about contention in runtime.(*spanSet).pop so I'm adding it to the list of potentially (but probably not) related issues.

@cagedmantis cagedmantis added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Jul 10, 2024
@cagedmantis cagedmantis added this to the Go1.21.13 milestone Jul 10, 2024
@felixge
Copy link
Contributor Author

felixge commented Jul 18, 2024

Our investigation is increasingly pointing to a kernel bug as the root cause here. Other processes seem to be failing on the same host as well. I'm leaving this open until we have fully confirmed it, but at this point I don't think this needs to be investigated by the Go team.

@mknyszek mknyszek modified the milestones: Go1.21.13, Backlog Jul 30, 2024
@felixge
Copy link
Contributor Author

felixge commented Aug 5, 2024

Our investigation has confirmed that this was caused by the aforementioned kernel bug. Closing this. Sorry for the noise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
Development

No branches or pull requests

5 participants