-
Notifications
You must be signed in to change notification settings - Fork 18k
runtime: fatal error: stopTheWorld: not stopped #17442
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
The linked issue (dvyukov/go-fuzz#145) suggests that the program has a data race. Aren't all bets off if there is a data race? |
@quentinmit The crashing process has races on integers. @rsc says races on integers are safe. |
Yes: races on integers are not supposed to crash a program. |
I was able to reproduce this in about 10 minutes at tip (96414ca) using the following: cd 17442
GOPATH=$PWD
go get -d github.com/dvyukov/go-fuzz
cd src/github.com/dvyukov/go-fuzz
git checkout 430caef0d9e2de9993a25311ada71d9857cf9f80
sed -i -e 's,// runtime,runtime,' examples/test/test.go
go install ./go-fuzz ./go-fuzz-build
$GOPATH/bin/go-fuzz-build github.com/dvyukov/go-fuzz/examples/test
# Get core files from crashes.
echo 'core.%p' | sudo tee /proc/sys/kernel/core_pattern
ulimit -c unlimited
export GOTRACEBACK=crash
$GOPATH/bin/go-fuzz -bin test-fuzz.zip -workdir examples/test I applied the following patch so I could match crashes with core files: diff --git a/go-fuzz/testee.go b/go-fuzz/testee.go
index c918632..be3f0d3 100644
--- a/go-fuzz/testee.go
+++ b/go-fuzz/testee.go
@@ -114,6 +114,7 @@ func (bin *TestBinary) test(data []byte) (res int, ns uint64, cover, sonar, outp
hdr := fmt.Sprintf("program hanged (timeout %v seconds)\n\n", *flagTimeout)
output = append([]byte(hdr), output...)
}
+ output = append(output, fmt.Sprintf("\nPID: %d\n", bin.testee.cmd.Process.Pid)...)
bin.testee = nil
return
} To debug the core file: unzip test-fuzz.zip
gdb cover.exe core.$BADPID I haven't dug into the core yet. |
I believe this is a race between freezetheworld and stopTheWorldWithSema. stopTheWorldWithSema is panicking because it expects |
Further evidence: in the core I have, both M 0 and M 5 have a non-zero "dying". stopTheWorldWithSema is running on M 5 from goroutine 1. It must have come in second because it's blocked acquiring paniclk in startpanic_m. This also indicates that the bad stopwait value isn't because stopTheWorldWithSema itself panicked, since it hasn't reached the freezetheworld yet. M 5's stack trace
Strangely, M 0, despite having a non-zero dying, is just in notesleep in stopm. |
I figured out a little more. The racing panic is specifically caused by a SIGABRT from the test harness (when the test times out), which enters the panic path of SIGABRT. I haven't figured out the exact circumstances of the race, but this could explain why I couldn't reproduce this with a normal panic.
|
The following program reproduces this for me in around a second. It starts a subprocess that gets multiple Ms running, one of which repeatedly stops the world. It then sends it a SIGABRT. It's important that there be multiple Ms and that the one that receives the signal be different from the one doing the STW. It's not too sensitive to the timing of the SIGABRT, but the timing I have below worked reliably on my laptop. The timeout handling in the test isn't necessary for reproduction, but sometimes the child gets wedged (I think for unrelated reasons), so it's nice to have. package main
import (
"bytes"
"fmt"
"log"
"math/rand"
"os"
"os/exec"
"runtime"
"strings"
"syscall"
"time"
)
func main() {
if len(os.Args) > 1 && os.Args[1] == "-sub" {
go func() {
for {
runtime.GC()
}
}()
for {
time.Now()
}
}
self, err := os.Executable()
if err != nil {
log.Fatal(err)
}
for {
var buf bytes.Buffer
cmd := exec.Command(self, "-sub")
cmd.Stdout = &buf
cmd.Stderr = &buf
if err := cmd.Start(); err != nil {
log.Fatal(err)
}
time.Sleep(2*time.Millisecond + time.Duration(rand.Intn(int(5*time.Millisecond))))
cmd.Process.Signal(syscall.SIGABRT)
done := make(chan bool)
dump := false
go func() {
for {
t := time.NewTimer(time.Second)
select {
case <-done:
t.Stop()
return
case <-t.C:
fmt.Println("timed out; sending KILL")
dump = true
cmd.Process.Signal(syscall.SIGKILL)
}
}
}()
cmd.Wait()
done <- true
fail := strings.Contains(buf.String(), "not stopped")
if dump || fail {
fmt.Println(buf.String())
if fail {
break
}
}
}
} |
So, at this point, the specific sequence of events is pretty clear:
I'm inclined to fix this by having @ianlancetaylor, are there maybe cleaner ways to solve this in the signal handler? |
From what you are saying, the same problem would happen if one goroutine calls panic, the panic is not recovered, and the goroutine is about to call Since |
That can't happen because the panicking goroutine will block the STW, so stopTheWorldWithSema will never get to the point where it observes the bad stopwait value. The problem with signals is that they can initiate a panic on an already "stopped" M, so unlike a normal panic, they can race with STW. |
Ah, I see. I can't think of anything clever we could do in the signal handler. We need it to crash the program. I suppose we could skip the call to |
CL https://golang.org/cl/34611 mentions this issue. |
go version devel +d13fa4d Fri Oct 14 03:58:15 2016 +0000 linux/amd64
I am seeing the following crashes:
Full crash message:
https://gist.github.com/dvyukov/7968c10af421d0ef5c5dcc6a9cdfff43
To reproduce:
check out github.com/dvyukov/go-fuzz on 430caef0d9e2de9993a25311ada71d9857cf9f80
apply:
from go-fuzz dir:
go install ./go-fuzz ./go-fuzz-build
go-fuzz-build github.com/dvyukov/go-fuzz/examples/test
go-fuzz -bin test-fuzz.zip -workdir examples/test
after some time the crash report appears in examples/test/crashers/*.output
if go-fuzz crashes with a panic, restart it, it crashes for unrelated reason
The text was updated successfully, but these errors were encountered: