-
Notifications
You must be signed in to change notification settings - Fork 18k
runtime: solaris/amd64 crash in garbage collector #7554
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Labels
Comments
Here is a similar crash on linux/race builder: SIGSEGV: segmentation violation PC=0x4070f9 goroutine 0 [idle]: scanblock(0x7f40852df000, 0x7f4084aabc00) src/pkg/runtime/mgc0.c:948 +0x389 markroot(0xc208016480, 0x2b0000000b) src/pkg/runtime/mgc0.c:1344 +0xd9 runtime.parfordo(0xc208016480) src/pkg/runtime/parfor.c:88 +0xa3 gc(0x7f40842abc80) src/pkg/runtime/mgc0.c:2322 +0x183 http://build.golang.org/log/370a1f515e9904f35f8b975c85253b0ab9472efa The line is: case GC_APTR: >>> obj = *(void**)(stack_top.b + pc[1]); pc += 2; break; Labels changed: added release-go1.3. Status changed to Accepted. |
Interesting. Crashes reliably on loaded Solaris with that many CPUs. Could be port's fault, but then again the crash in http://build.golang.org/log/370a1f515e9904f35f8b975c85253b0ab9472efa is really similar; stack_top.b is nil. |
Ok, more info. I re-tested on a different 24-way machine. This machine was idle, had 80GB RAM. I got no failures, everything works smoothly. It fails only on that other machine. The difference between the machines is that the failing one has various soft quotas regarding cpu/memory/io, perhaps most importantly, it has a really low quota on memory utilisation. |
If stack_top.b is zero, then you pulled a zero out of the work buffers (see "Fetch b from the work buffer"). Why is that happening? There aren't supposed to be nils in the work buffer. In fact everything in the work buffer is supposed to be in the range [arena_start, arena_end). Are those set correctly? Add checks to the code that inserts pointers into the work buffer to make sure nil is not being inserted. If it is, why? If it is not, why do we find a nil when we read back from the buffer? Is the memory being zeroed halfway through the collection somehow? |
I don't have any suggestions significantly better than what Russ said. I would start with assuming that we put NULL into workbuf (and that it's not zeroed later). There is a number of places where we insert objects into workbuf w/o explicit checks and assuming that it's not NULL. In particular enqueue1 function for roots, e.g.: enqueue1(&wbuf, (Obj){p, s->elemsize, 0}); enqueue1(&wbuf, (Obj){(void*)&spf->fn, PtrSize, 0}); enqueue1(&wbuf, (Obj){(void*)&spf->fint, PtrSize, 0}); enqueue1(&wbuf, (Obj){(void*)&spf->ot, PtrSize, 0}); I would add NULL checks into enqueue1/enqueue/flushobjbuf/flushptrbuf functions. If the check fires, then insert checks earlier, and so on. |
Comment 18 by [email protected]: The issue can be mitigated by creating a processor set and bulding/running go on it: psrset -c [assuming you don't have any, a processor set 1 will be created] psrset -a 1 0 psrset -a 1 1 psrset -e 1 <some go program> |
Note that this failure only occurs in the runtime tests and we have never seen it occur in practice. Also, if you only want to build Go without running the tests, you can run make.bash instead of all.bash. Thanks for the tip regarding psrset. Owner changed to @4ad. |
zheganin, I've seen similar symptoms on a joyent smartos machine which presents as having something like 24 cores, but only a few gb of ram (and presumably little overcommit). I've always assumed the failure was related not to a concurrency issue, but a memory allocation failure (due to quotaring) when possibly 24 go test processes start concurrently. |
This issue was closed.
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
The text was updated successfully, but these errors were encountered: