runtime: potential gc deadlock #68373

felixge · 2024-07-10T15:54:58Z

Note: I'm posting this issue on behalf of a colleague who did the analysis below.

Go version: 1.20.10.

Go env: N/A

But we know the following about the linux/amd64 VM this happens on:

OS: SLES12 (SUSE Linux Enterprise Server 12) SP3
Linux Kernel: 4.4 (custom built)
Core count: 80
Memory Size: 520 GB

We're seeing the same issue with a 96 core machine.

What did you do?

We are investigating an issue where Go garbage collector appears to get deadlocked. The Go program either starts to use more and more memory, or just becomes inoperable.

We collected debugging information and core dumps from the affected
processes.

The process is using CGO extensively, and we haven't ruled out heap corruption or possible hardware issues yet, but we are sharing this early in case this seems familiar to anyone.

What did you see happen?

Investigating the reports from a customer, we found that LastGC value (as reported by memstats published via expvars) can be hours behind the current time, and NextGC ≪ HeapAlloc, for example, from a snapshot taken on Wed Jun 19 20:34:24 UTC 2024:

HeapAlloc: 2.9148435376e+10
NextGC: 7.6087616e+07
LastGC: 1.7187810082630615e+18  # Wed Jun 19 07:10:08 UTC 2024

We obtained some core dumps from the affected instances of the program and found that the GC in them was active, but blocked in different places, on goroutines executing the same tight loop in the runtime spanSet#pop:

https://github.com/golang/go/blob/go1.20.10/src/runtime/mspanset.go#L189-L195

One example, from a case where GC STW phase was waiting on spanSet#pop:

(dlv) grs
Goroutine 2 - User: /usr/local/go/src/runtime/sys_linux_amd64.s:555 runtime.futex (0x23d0043) (thread 35209)
(dlv) grs -s
Goroutine 2 - Start: /usr/local/go/src/runtime/proc.go:296 runtime.forcegchelper (0x23982a0) (thread 35209)
(dlv) gr 2
(dlv) bt
0  0x00000000023d0043 in runtime.futex
   at /usr/local/go/src/runtime/sys_linux_amd64.s:555
1  0x000000000239172f in runtime.futexsleep
   at /usr/local/go/src/runtime/os_linux.go:75
2  0x0000000002366905 in runtime.notetsleep_internal
   at /usr/local/go/src/runtime/lock_futex.go:202
3  0x0000000002366a54 in runtime.notetsleep
   at /usr/local/go/src/runtime/lock_futex.go:225
4  0x000000000239aa36 in runtime.stopTheWorldWithSema
   at /usr/local/go/src/runtime/proc.go:1298
5  0x00000000023cbf69 in runtime.systemstack
   at /usr/local/go/src/runtime/asm_amd64.s:496
6  0x00000000023cbf00 in runtime.systemstack_switch
   at /usr/local/go/src/runtime/asm_amd64.s:463
7  0x00000000023766dc in runtime.gcStart
   at /usr/local/go/src/runtime/mgc.go:665
8  0x0000000002398305 in runtime.forcegchelper
   at /usr/local/go/src/runtime/proc.go:311
9  0x00000000023ce121 in runtime.goexit
(dlv) p sched.stopwait
3
(dlv) grs -r -with curloc mspanset
  Goroutine 661 - Runtime: /usr/local/go/src/runtime/mspanset.go:194 runtime.(*spanSet).pop (0x238ebc2) (thread 35212)
  Goroutine 662 - Runtime: /usr/local/go/src/runtime/mspanset.go:194 runtime.(*spanSet).pop (0x238ebc2) (thread 35249)
  Goroutine 822 - Runtime: /usr/local/go/src/runtime/mspanset.go:194 runtime.(*spanSet).pop (0x238ebc2) (thread 35210)
[3 goroutines]

All other goroutines appear to be sleeping, as would be expected at this stage in stopTheWorldWithSema.

In another case, we found a gcBgMarkWorker stuck on the same issue:

(dlv) grs
  Goroutine 63 - User: /usr/local/go/src/runtime/sys_linux_amd64.s:555 runtime.futex (0x23d0043) (thread 3799) [unknown wait reason 30]
(dlv) grs -s
  Goroutine 63 - Go: /usr/local/go/src/runtime/mgc.go:1199 runtime.gcBgMarkStartWorkers (0x2377be5) (thread 3799) [unknown wait reason 30]
(dlv) gr 63
Switched from 645 to 63 (thread 3799)
(dlv) bt
 0  0x00000000023d0043 in runtime.futex
    at /usr/local/go/src/runtime/sys_linux_amd64.s:555
 1  0x000000000239172f in runtime.futexsleep
    at /usr/local/go/src/runtime/os_linux.go:75
 2  0x0000000002366905 in runtime.notetsleep_internal
    at /usr/local/go/src/runtime/lock_futex.go:202
 3  0x0000000002366a54 in runtime.notetsleep
    at /usr/local/go/src/runtime/lock_futex.go:225
 4  0x000000000239b42b in runtime.forEachP
    at /usr/local/go/src/runtime/proc.go:1718
 5  0x00000000023c4e07 in runtime.gcMarkDone.func1
    at /usr/local/go/src/runtime/mgc.go:814
 6  0x00000000023cbf69 in runtime.systemstack
    at /usr/local/go/src/runtime/asm_amd64.s:496
 7  0x00000000023cbf00 in runtime.systemstack_switch
    at /usr/local/go/src/runtime/asm_amd64.s:463
 8  0x0000000002376b1f in runtime.gcMarkDone
    at /usr/local/go/src/runtime/mgc.go:807
 9  0x0000000002377f85 in runtime.gcBgMarkWorker
    at /usr/local/go/src/runtime/mgc.go:1407
10  0x00000000023ce121 in runtime.goexit
    at /usr/local/go/src/runtime/asm_amd64.s:1598

This case appeared as leaking memory: parts of the program continued to run, allocating memory, but the GC was not running, waiting on goroutines stuck in the same place:

(dlv) p runtime.sched.safePointWait
2
(dlv) p runtime.allp[14].runSafePointFn
1
(dlv) p runtime.allp[21].runSafePointFn
1
(dlv) p (*runtime.m)(runtime.allp[14].m).procid
80708
(dlv) p (*runtime.m)(runtime.allp[21].m).procid
5220
(dlv) threads
  Thread 5220 at 0x238ebc2 /usr/local/go/src/runtime/mspanset.go:194 runtime.(*spanSet).pop
  Thread 80708 at 0x238ebc2 /usr/local/go/src/runtime/mspanset.go:194 runtime.(*spanSet).pop

We were able to confirm that the GC was indeed deadlocked, and we didn't get lucky with the moment of the snapshot.

Monotonic time when the GC was started:

(dlv) frame 9
Frame 9: /usr/local/go/src/runtime/mgc.go:1407 (PC: 2377f85)
(dlv) p startTime - time.startNano
423405108916860

A timestamp from a goroutine that triggered spanSet#pop call provides a reference time:

(dlv) thread 5220
(dlv) frame 9
> runtime.(*spanSet).pop() /usr/local/go/src/runtime/mspanset.go:194 (PC: 0x238ebc2)
Warning: debugging optimized function
Frame 9: /omnibus/src/datadog-agent/src/github.com/DataDog/datadog-agent/pkg/aggregator/aggregator.go:594 (PC: 56aea45)
(dlv) args
agg = ("*github.com/DataDog/datadog-agent/pkg/aggregator.BufferedAggregator")(0xc000ede420)
start = time.Time(2024-06-28T08:56:12-04:00, +423405107637349){wall: 13949841138539087330, ext: 423405107637349, loc: (*time.Location)(0xbb67ae0)}
(dlv) p 423405108916860 - 423405107637349
1279511

And from another part in the program that didn't get blocked by the GC, we can find the current time:

(dlv) p "github.com/DataDog/datadog-agent/pkg/status/health".readinessAndLivenessCatalog.latestRun
time.Time(2024-07-01T21:59:12Z, +715185107371868){
        wall: 13950154434928228569,
        ext: 715185107371868,
(dlv) p (715185107371868 - 423405107637349) / 1e9 / 86400
3.3770833302606365

In this case, the GC appears to have been stuck for over 3 days.

What did you expect to see?

spanSet#pop executes quickly and doesn’t block the GC.

The text was updated successfully, but these errors were encountered:

gabyhelp · 2024-07-10T15:56:43Z

Related Issues and Documentation

_{(Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.)}

felixge · 2024-07-10T16:01:59Z

cc @mknyszek: We're still continuing our investigation on this. Feel free to ignore this since it's for go1.20.x. But I figured we should raise this on the off-chance that you know of any GC locking issues that have been fixed in go1.21+ or can give some advice on how we could rule out cgo related heap corruption. 🙇

@gabyhelp good bot! I looked through all of those, and they seem unrelated. That being said, I noticed that #61428 talks about contention in runtime.(*spanSet).pop so I'm adding it to the list of potentially (but probably not) related issues.

felixge · 2024-07-18T14:56:49Z

Our investigation is increasingly pointing to a kernel bug as the root cause here. Other processes seem to be failing on the same host as well. I'm leaving this open until we have fully confirmed it, but at this point I don't think this needs to be investigated by the Go team.

felixge · 2024-08-05T12:22:16Z

Our investigation has confirmed that this was caused by the aforementioned kernel bug. Closing this. Sorry for the noise.

gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Jul 10, 2024

cagedmantis added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Jul 10, 2024

cagedmantis added this to the Go1.21.13 milestone Jul 10, 2024

mknyszek added this to Go Compiler / Runtime Jul 17, 2024

mknyszek moved this to Todo in Go Compiler / Runtime Jul 24, 2024

mknyszek modified the milestones: Go1.21.13, Backlog Jul 30, 2024

felixge closed this as completed Aug 5, 2024

github-project-automation bot moved this from Todo to Done in Go Compiler / Runtime Aug 5, 2024

func25 mentioned this issue Aug 29, 2024

vminsert: falls with OOM after GC stops working VictoriaMetrics/VictoriaMetrics#5697

Closed

gabyhelp mentioned this issue Mar 18, 2025

runtime: goroutines blocked in mallocgc triggering gcStart #72932

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runtime: potential gc deadlock #68373

runtime: potential gc deadlock #68373

felixge commented Jul 10, 2024 •

edited

Loading

gabyhelp commented Jul 10, 2024

felixge commented Jul 10, 2024 •

edited

Loading

felixge commented Jul 18, 2024 •

edited

Loading

felixge commented Aug 5, 2024 •

edited

Loading

runtime: potential gc deadlock #68373

runtime: potential gc deadlock #68373

Comments

felixge commented Jul 10, 2024 • edited Loading

gabyhelp commented Jul 10, 2024

felixge commented Jul 10, 2024 • edited Loading

felixge commented Jul 18, 2024 • edited Loading

felixge commented Aug 5, 2024 • edited Loading

felixge commented Jul 10, 2024 •

edited

Loading

felixge commented Jul 10, 2024 •

edited

Loading

felixge commented Jul 18, 2024 •

edited

Loading

felixge commented Aug 5, 2024 •

edited

Loading