Not planned
Description
#!watchflakes
default <- pkg == "runtime/pprof" && test == "TestProfilerStackDepth/mutex"
Issue created automatically to collect these failures.
Example (log):
=== RUN TestProfilerStackDepth/mutex
panic: test timed out after 3m0s
running tests:
TestProfilerStackDepth (2m44s)
TestProfilerStackDepth/mutex (2m43s)
goroutine 26723 gp=0x1d158c8 m=14 mp=0x1a80508 [running]:
panic({0x5ebd60, 0x192e030})
C:/b/s/w/ir/x/w/goroot/src/runtime/panic.go:804 +0x137 fp=0x1b54f84 sp=0x1b54f30 pc=0x4b00c7
testing.(*M).startAlarm.func1()
...
runtime/pprof.TestProfilerStackDepth.func2(0x18024b8)
C:/b/s/w/ir/x/w/goroot/src/runtime/pprof/pprof_test.go:2472 +0x80 fp=0x1857fb8 sp=0x1857eec pc=0x5c2ba0
testing.tRunner(0x18024b8, 0x18cabe8)
C:/b/s/w/ir/x/w/goroot/src/testing/testing.go:1690 +0xc7 fp=0x1857fe4 sp=0x1857fb8 pc=0x534177
testing.(*T).Run.gowrap1()
C:/b/s/w/ir/x/w/goroot/src/testing/testing.go:1743 +0x2c fp=0x1857ff0 sp=0x1857fe4 pc=0x53512c
runtime.goexit({})
C:/b/s/w/ir/x/w/goroot/src/runtime/asm_386.s:1393 +0x1 fp=0x1857ff4 sp=0x1857ff0 pc=0x4b6f51
created by testing.(*T).Run in goroutine 26684
C:/b/s/w/ir/x/w/goroot/src/testing/testing.go:1743 +0x3bb
Metadata
Metadata
Assignees
Labels
Type
Projects
Status
Done
Status
Done
Relationships
Development
No branches or pull requests
Activity
gopherbot commentedon Jul 24, 2024
Found new dashboard test flakes for:
2024-07-23 21:29 gotip-windows-386 go@c18ff292 runtime/pprof.TestProfilerStackDepth/mutex [ABORT] (log)
— watchflakes
mknyszek commentedon Jul 31, 2024
I spent some time looking into this failure, and it's clearly some kind of GC deadlock. Notably, the goroutine trying to write out the mutex profile calls into malloc, then into GC assists, and gets stuck somewhere in assists. Unfortunately it's unclear where. Some candidates include:
The lack of any smoking guns in the stack dump suggests to me that
sysmon
may be involved. Or, it's some kind of memory corruption. I also wonder if this is a platform-specific failure; windows-386 is a bit funky.A final note is all the GC mark workers in the stack dump are parked. This kinda makes sense -- the GC is not done, but they can't find any work to do because in 2 minutes they almost certainly found everything they could aside from wherever that one goroutine is stuck.
mknyszek commentedon Jul 31, 2024
Actually, the assist queue lock theory is not possible. We know for a fact we're in
gcAssistAlloc1
since we're on the system stack and the line numbers match up for us to have entered the system stack there.So, the likely explanation is difficulty preempting some goroutine (I can't think of any other reason we'd block so long...).
mknyszek commentedon Jul 31, 2024
I have a new possible lead/theory. This test is somewhat special in that the mutex profile rate is 1. What if there's a place where we're acquiring and/or releasing a runtime lock, somewhere deep in
gcDrainN
, where we end up recording that lock in the mutex profile and blocking somewhere in there. I don't immediately see why that would cause a deadlock, but since this is a new kind of failure, recent changes (such as in the Go 1.23 cycle) may be relevant.Though, this doesn't really line up with the evidence, since there should be some other goroutine waiting on a lock. Maybe it's a self deadlock? The write-mutex-profile code is interacting with the same locks that the runtime unlock path is, though I don't immediately see how we could self-deadlock in exactly the place we're stuck. Even if we were holding a lock (which it doesn't look like we are), that should prevent recording further locks.
mknyszek commentedon Aug 8, 2024
In the last triage meeting, I said we'd just wait for more failures. So, setting the WaitingForInfo label.
gopherbot commentedon Sep 8, 2024
Timed out in state WaitingForInfo. Closing.
(I am just a bot, though. Please speak up if this is a mistake or you have the requested information.)