Skip to content

os: on NetBSD, *Process.Wait sometimes deadlocks after cmd.Process.Signal returns "process already finished" #48789

Closed
@bcmills

Description

@bcmills

I've noticed a recurring pattern in cmd/go test failures on NetBSD builders, and I believe that it indicates a bug in either os.Process or the kernel itself on that platform.

The cmd/go tests start processes running “in the background” using os/exec, and use a waitOrStop function to terminate any remaining processes at the conclusion of each test.

The waitOrStop function starts a goroutine, then blocks on cmd.Wait(). The background goroutine blocks until either the call to cmd.Wait completes or the Context is canceled, then sends a signal to the process. If that signal fails with os: process already finished, then we assume that the process actually has already finished, and the background goroutine simply blocks until the call to cmd.Wait (inevitably) returns.

That all happens here:

go/src/cmd/go/script_test.go

Lines 1164 to 1206 in a05a7d4

errc := make(chan error)
go func() {
select {
case errc <- nil:
return
case <-ctx.Done():
}
err := cmd.Process.Signal(interrupt)
if err == nil {
err = ctx.Err() // Report ctx.Err() as the reason we interrupted.
} else if err.Error() == "os: process already finished" {
errc <- nil
return
}
if killDelay > 0 {
timer := time.NewTimer(killDelay)
select {
// Report ctx.Err() as the reason we interrupted the process...
case errc <- ctx.Err():
timer.Stop()
return
// ...but after killDelay has elapsed, fall back to a stronger signal.
case <-timer.C:
}
// Wait still hasn't returned.
// Kill the process harder to make sure that it exits.
//
// Ignore any error: if cmd.Process has already terminated, we still
// want to send ctx.Err() (or the error from the Interrupt call)
// to properly attribute the signal that may have terminated it.
_ = cmd.Process.Kill()
}
errc <- err
}()
waitErr := cmd.Wait()
if interruptErr := <-errc; interruptErr != nil {
return interruptErr
}


What I'm seeing on (some of?) the NetBSD builders is that after cmd.Process.Signal fails with process already finished, the call to cmd.Wait continues to block, seemingly forever.

The relevant goroutine traces are:

goroutine 1536 [syscall, 8 minutes]:
syscall.Syscall6(0x42d7, 0x9117f20, 0x0, 0x8f1a5a0, 0x0, 0x0, 0x0)
	/tmp/workdir/go/src/syscall/asm_unix_386.s:44 +0x5 fp=0x9117ecc sp=0x9117ec8 pc=0x80bc1e5
syscall.wait4(0x42d7, 0x9117f20, 0x0, 0x8f1a5a0)
	/tmp/workdir/go/src/syscall/zsyscall_netbsd_386.go:35 +0x5b fp=0x9117f04 sp=0x9117ecc pc=0x80b8f3b
syscall.Wait4(0x42d7, 0x9117f44, 0x0, 0x8f1a5a0)
	/tmp/workdir/go/src/syscall/syscall_bsd.go:145 +0x3b fp=0x9117f28 sp=0x9117f04 pc=0x80b6abb
os.(*Process).wait(0x8e98150)
	/tmp/workdir/go/src/os/exec_unix.go:44 +0xf2 fp=0x9117f60 sp=0x9117f28 pc=0x80e2762
os.(*Process).Wait(...)
	/tmp/workdir/go/src/os/exec.go:132
os/exec.(*Cmd).Wait(0x8c7bc30)
	/tmp/workdir/go/src/os/exec/exec.go:507 +0x4d fp=0x9117f9c sp=0x9117f60 pc=0x817eedd
cmd/go_test.waitOrStop({0x874073c, 0x8dc2300}, 0x8c7bc30, {0x873f89c, 0x869bd38}, 0x6472661e3)
	/tmp/workdir/go/src/cmd/go/script_test.go:1203 +0x11a fp=0x9117fc0 sp=0x9117f9c pc=0x85800ea
cmd/go_test.(*testScript).startBackground.func1()
	/tmp/workdir/go/src/cmd/go/script_test.go:1143 +0x5f fp=0x9117ff0 sp=0x9117fc0 pc=0x857ff8f
runtime.goexit()
	/tmp/workdir/go/src/runtime/asm_386.s:1311 +0x1 fp=0x9117ff4 sp=0x9117ff0 pc=0x80ab241
created by cmd/go_test.(*testScript).startBackground
	/tmp/workdir/go/src/cmd/go/script_test.go:1142 +0x390

goroutine 1537 [chan send, 8 minutes]:
runtime.gopark(0x869b798, 0x8dbc430, 0xf, 0x16, 0x2)
	/tmp/workdir/go/src/runtime/proc.go:366 +0xf6 fp=0x90e8f04 sp=0x90e8ef0 pc=0x807f376
runtime.chansend(0x8dbc400, 0x87302f0, 0x1, 0x8580314)
	/tmp/workdir/go/src/runtime/chan.go:258 +0x2f4 fp=0x90e8f48 sp=0x90e8f04 pc=0x804db54
runtime.chansend1(0x8dbc400, 0x87302f0)
	/tmp/workdir/go/src/runtime/chan.go:144 +0x24 fp=0x90e8f60 sp=0x90e8f48 pc=0x804d854
cmd/go_test.waitOrStop.func1()
	/tmp/workdir/go/src/cmd/go/script_test.go:1176 +0x184 fp=0x90e8ff0 sp=0x90e8f60 pc=0x8580314
runtime.goexit()
	/tmp/workdir/go/src/runtime/asm_386.s:1311 +0x1 fp=0x90e8ff4 sp=0x90e8ff0 pc=0x80ab241
created by cmd/go_test.waitOrStop
	/tmp/workdir/go/src/cmd/go/script_test.go:1165 +0x10e

Note that goroutine 1537 is blocked at script_test.go:1176, which is the send on errc after cmd.Process.Signal fails with os: process already finished.

Goroutine 1536 is blocked at the call to cmd.Wait, which is itself blocked on syscall.Wait4.


The failure rate with these symptoms is fairly high: something on the order of 20 failures per month.

greplogs --dashboard -md -l -e '(?m)panic: test timed out.*(?:.*\n)*.*\[syscall, .* minutes\]:\n(?:.+\n\t.+\n)*syscall\.Wait.*\n\t.+\n(?:.+\n\t.+\n)*cmd/go_test\.waitOrStop'

2021-10-04T22:46:23-17674e2/netbsd-386-9_0
2021-10-04T18:15:09-9f8d558/netbsd-386-9_0
2021-09-30T19:56:06-eb9f090/netbsd-386-9_0
2021-09-29T15:23:27-aeb4fba/netbsd-amd64-9_0
2021-09-28T17:18:36-ff7b041/netbsd-amd64-9_0
2021-09-28T15:26:21-583eeaa/netbsd-386-9_0
2021-09-27T18:57:20-3d795ea/netbsd-amd64-9_0
2021-09-22T16:24:17-74ba70b/netbsd-amd64-9_0
2021-09-22T15:00:53-91c2318/netbsd-amd64-9_0
2021-09-21T20:39:31-48cf96c/netbsd-386-9_0
2021-09-20T23:04:13-d7e3e44/netbsd-arm64-bsiegert
2021-09-20T00:13:47-a83a558/netbsd-amd64-9_0
2021-09-17T19:32:44-74e384f/netbsd-amd64-9_0
2021-09-16T23:57:40-8d2a9c3/netbsd-386-9_0
2021-09-16T19:38:19-bcdc61d/netbsd-amd64-9_0
2021-09-10T17:11:39-5a4b9f9/netbsd-amd64-9_0
2021-09-09T16:32:28-a53e3d5/netbsd-amd64-9_0
2021-09-08T11:57:03-9295723/netbsd-386-9_0
2021-09-07T03:56:13-6226020/netbsd-386-9_0
2021-09-04T10:58:11-5ec298d/netbsd-386-9_0
2021-08-31T16:43:46-6815235/netbsd-386-9_0
2021-08-30T22:07:49-b06cfe9/netbsd-386-9_0
2021-08-27T05:13:44-2c60a99/netbsd-386-9_0
2021-08-24T22:23:12-54cdef1/netbsd-386-9_0
2021-08-23T21:22:58-8157960/netbsd-386-9_0
2021-08-23T21:22:58-8157960/netbsd-arm64-bsiegert
2021-08-22T21:43:43-1958582/netbsd-386-9_0
2021-08-20T03:25:17-c92c2c9/netbsd-386-9_0
2021-08-19T20:50:13-65074a4/netbsd-amd64-9_0
2021-08-18T21:19:22-c2bd9ee/netbsd-386-9_0
2021-08-18T20:11:28-165ebd8/netbsd-386-9_0
2021-08-17T16:22:15-cf12b0d/netbsd-386-9_0
2021-08-17T15:00:04-3001b0a/netbsd-386-9_0
2021-08-17T04:37:32-a304273/netbsd-386-9_0
2021-08-17T01:29:37-1951afc/netbsd-386-9_0
2021-08-16T18:44:38-56a919f/netbsd-386-9_0
2021-08-16T18:44:32-ff36d11/netbsd-amd64-9_0
2021-08-16T13:38:52-a192ef8/netbsd-386-9_0
2021-08-12T17:43:16-39634e7/netbsd-386-9_0
2021-08-09T20:06:35-f1dce31/netbsd-386-9_0
2021-08-06T16:51:12-70546f6/netbsd-386-9_0
2021-07-28T03:27:13-b39e0f4/netbsd-386-9_0
2021-07-15T20:39:22-0941dbc/netbsd-amd64-9_0
2021-06-29T16:57:13-3463852/netbsd-386-9_0
2021-06-28T20:51:30-956c81b/netbsd-386-9_0
2021-06-24T03:45:33-a9bb382/netbsd-386-9_0
2021-06-15T20:59:42-d77f4c0/netbsd-amd64-9_0
2021-06-11T20:31:30-16b5d76/netbsd-386-9_0
2021-06-09T17:11:44-df35ade/netbsd-386-9_0
2021-06-09T15:09:13-139e935/netbsd-386-9_0
2021-05-21T17:43:46-4fda54c/netbsd-arm64-bsiegert
2021-05-21T17:35:47-8876b9b/netbsd-386-9_0
2021-05-17T16:02:12-b1aff42/netbsd-386-9_0
2021-05-12T15:23:09-0388670/netbsd-386-9_0
2021-05-12T02:04:57-1a0ea1a/netbsd-386-9_0
2021-05-11T18:22:54-9b84814/netbsd-386-9_0
2021-05-10T13:16:56-2870259/netbsd-386-9_0
2021-05-06T16:00:55-6c591f7/netbsd-386-9_0
2020-09-19T05:13:19-ccf581f/netbsd-386-9_0

Metadata

Metadata

Assignees

No one assigned

    Labels

    FrozenDueToAgeNeedsInvestigationSomeone must examine and confirm this is a valid issue and not a duplicate of an existing one.OS-NetBSD

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions