Skip to content

runtime: solaris/amd64 crash in garbage collector #7554

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
4ad opened this issue Mar 15, 2014 · 20 comments
Closed

runtime: solaris/amd64 crash in garbage collector #7554

4ad opened this issue Mar 15, 2014 · 20 comments

Comments

@4ad
Copy link
Member

4ad commented Mar 15, 2014

runtime: gc crash in tests

We're seeing some crashes in the garbage collector while running runtime
tests on a particular Solaris machine, for example:

    http://build.golang.org/log/f80b1657689632e2f02fd3976f6b15edc4d130a6
    http://build.golang.org/log/09d748b886fe7e70d34f6e4bf58561a3ea5871b1

The machine exposes 24 CPUs but has little memory. On another machine
with 24 CPUs but with more memory I can still see the failures, but much
more rarely. On my 4 CPU machine I never see the failures.

If I disable the GC, the failure goes away.

This is possibly related to issue #7502 and issue #7159 (the crash seems
to occur only in tests that use reflection). It happens in various tests,
but it always fails in the same place.

    SIGSEGV: segmentation violation
    PC=0x409620
    
    goroutine 0 [idle]:
    scanblock(0x0, 0x1)
           /home/aram/go/src/pkg/runtime/mgc0.c:1023 +0xc80
    runtime.gchelper()
          /home/aram/go/src/pkg/runtime/mgc0.c:2044 +0x45
    stopm()
      /home/aram/go/src/pkg/runtime/proc.c:960 +0x111
    gcstopm()
            /home/aram/go/src/pkg/runtime/proc.c:1125 +0xc4
    schedule()
           /home/aram/go/src/pkg/runtime/proc.c:1322 +0x4a
    runtime.gosched0(0xc208044480)
       /home/aram/go/src/pkg/runtime/proc.c:1437 +0x9e
    runtime.newstack()
           /home/aram/go/src/pkg/runtime/stack.c:657 +0x330
    runtime.morestack()
         /home/aram/go/src/pkg/runtime/asm_amd64.s:228 +0x6a

The code in scanblock is:

    case GC_DEFAULT_PTR:
        while(stack_top.b <= end_b) {
            obj = *(byte**)stack_top.b;
>>>            stack_top.b += PtrSize;
            if(obj >= arena_start && obj < arena_used) {
                *sbuf.ptr.pos++ = (PtrTarget){obj, 0};
                if(sbuf.ptr.pos == sbuf.ptr.end)
                    flushptrbuf(&sbuf);
            }
        }
        goto next_block;

I have confirmed in the debugger that stack_top.b is zero.

Other frames related to the GC are:

    goroutine 18 [GC sweep wait]:
    runtime.park(0x415bf0, 0x856d68, 0x8534ac)
     /home/aram/go/src/pkg/runtime/proc.c:1370 +0x89
    runtime.parkunlock(0x856d68, 0x8534ac)
       /home/aram/go/src/pkg/runtime/proc.c:1386 +0x3b
    bgsweep()
            /home/aram/go/src/pkg/runtime/mgc0.c:1918 +0xb2
    runtime.goexit()
     /home/aram/go/src/pkg/runtime/proc.c:1446
    created by runtime.gc
      /home/aram/go/src/pkg/runtime/mgc0.c:2187
    
    goroutine 388 [finalizer wait]:
    runtime.park(0x415bf0, 0x856d68, 0x853692)
          /home/aram/go/src/pkg/runtime/proc.c:1370 +0x89
    runtime.parkunlock(0x856d68, 0x853692)
       /home/aram/go/src/pkg/runtime/proc.c:1386 +0x3b
    runfinq()
            /home/aram/go/src/pkg/runtime/mgc0.c:2555 +0xc2
    runtime.goexit()
     /home/aram/go/src/pkg/runtime/proc.c:1446
    created by runtime.gc
      /home/aram/go/src/pkg/runtime/mgc0.c:2187

    goroutine 52392 [garbage collection]:
    runtime.gc(0xfffffd7f00000000)
     /home/aram/go/src/pkg/runtime/mgc0.c:2250 +0x1b2
    runtime.mallocgc(0x20, 0x673820, 0x0)
       /home/aram/go/src/pkg/runtime/malloc.goc:206 +0x1b7
    runtime.new(0x673820, 0x8)
       /home/aram/go/src/pkg/runtime/malloc.goc:815 +0x32
    reflect.Value.MapIndex(0x5f5a20, 0xc208108060, 0x0, 0x150, 0x5f7560, ...)
         /home/aram/go/src/pkg/reflect/value.go:1165 +0x45
    runtime_test.func·044()
           /home/aram/go/src/pkg/runtime/map_test.go:272 +0x15d
    runtime.goexit()
        /home/aram/go/src/pkg/runtime/proc.c:1446
    created by runtime_test.testConcurrentReadsAfterGrowth
     /home/aram/go/src/pkg/runtime/map_test.go:274 +0x26a

And frames related to reflection:

    goroutine 52398 [semacquire]:
    runtime.park(0x415bf0, 0x864b00, 0x84fac0)
         /home/aram/go/src/pkg/runtime/proc.c:1370 +0x89
    runtime.parkunlock(0x864b00, 0x84fac0)
       /home/aram/go/src/pkg/runtime/proc.c:1386 +0x3b
    runtime.semacquire(0x849294, 0x0)
            /home/aram/go/src/pkg/runtime/sema.goc:140 +0x10e
    runtime.gc(0xfffffd7f00000000)
     /home/aram/go/src/pkg/runtime/mgc0.c:2220 +0xfc
    runtime.mallocgc(0x20, 0x673820, 0x0)
        /home/aram/go/src/pkg/runtime/malloc.goc:206 +0x1b7
    runtime.new(0x673820, 0x8)
       /home/aram/go/src/pkg/runtime/malloc.goc:815 +0x32
    reflect.Value.MapIndex(0x5f5a20, 0xc208108060, 0x0, 0x150, 0x5f7560, ...)
         /home/aram/go/src/pkg/reflect/value.go:1165 +0x45
    runtime_test.func·044()
           /home/aram/go/src/pkg/runtime/map_test.go:272 +0x15d
    runtime.goexit()
        /home/aram/go/src/pkg/runtime/proc.c:1446
    created by runtime_test.testConcurrentReadsAfterGrowth
     /home/aram/go/src/pkg/runtime/map_test.go:274 +0x26a

    goroutine 52914 [runnable]:
    reflect.loadScalar(0xc208142e38, 0x8, 0xf2)
          /home/aram/go/src/pkg/reflect/value.go:223
    reflect.Value.MapKeys(0x5f5a20, 0xc208108060, 0x0, 0x150, 0x0, ...)
       /home/aram/go/src/pkg/reflect/value.go:1242 +0x343
    runtime_test.func·044()
          /home/aram/go/src/pkg/runtime/map_test.go:270 +0xc3
    runtime.goexit()
         /home/aram/go/src/pkg/runtime/proc.c:1446
    created by runtime_test.testConcurrentReadsAfterGrowth
     /home/aram/go/src/pkg/runtime/map_test.go:274 +0x26a
     
    goroutine 52890 [runnable]:
    reflect.mapiternext(0xc20810e190)
            /home/aram/go/src/pkg/runtime/hashmap.goc:1046
    reflect.Value.MapKeys(0x5f5a20, 0xc208108060, 0x0, 0x150, 0x0, ...)
           /home/aram/go/src/pkg/reflect/value.go:1244 +0x29c
    runtime_test.func·044()
          /home/aram/go/src/pkg/runtime/map_test.go:270 +0xc3
    runtime.goexit()
         /home/aram/go/src/pkg/runtime/proc.c:1446
    created by runtime_test.testConcurrentReadsAfterGrowth
     /home/aram/go/src/pkg/runtime/map_test.go:274 +0x26a
    
    goroutine 52395 [runnable]:
    reflect.mapiternext(0xc20824c230)
            /home/aram/go/src/pkg/runtime/hashmap.goc:1046
    reflect.Value.MapKeys(0x5f5a20, 0xc208108060, 0x0, 0x150, 0x0, ...)
           /home/aram/go/src/pkg/reflect/value.go:1244 +0x29c
    runtime_test.func·044()
          /home/aram/go/src/pkg/runtime/map_test.go:270 +0xc3
    runtime.goexit()
         /home/aram/go/src/pkg/runtime/proc.c:1446
    created by runtime_test.testConcurrentReadsAfterGrowth
     /home/aram/go/src/pkg/runtime/map_test.go:274 +0x26a
    
    goroutine 52917 [runnable]:
    reflect.loadScalar(0xc20826fb30, 0x8, 0x143)
         /home/aram/go/src/pkg/reflect/value.go:223
    reflect.Value.MapKeys(0x5f5a20, 0xc208108060, 0x0, 0x150, 0x0, ...)
       /home/aram/go/src/pkg/reflect/value.go:1242 +0x343
    runtime_test.func·044()
          /home/aram/go/src/pkg/runtime/map_test.go:270 +0xc3
    runtime.goexit()
         /home/aram/go/src/pkg/runtime/proc.c:1446
    created by runtime_test.testConcurrentReadsAfterGrowth
     /home/aram/go/src/pkg/runtime/map_test.go:274 +0x26a

Full stack trace can be found in the referenced build logs.
@4ad
Copy link
Member Author

4ad commented Mar 15, 2014

Comment 1:

I see this in a stack trace:
reflect.Value.assignTo(0x0, 0xdeaddeaddeaddead, 0xda, 0x20, 0x6b0f70, ...)

@4ad
Copy link
Member Author

4ad commented Mar 15, 2014

Comment 2:

Disabling precise stack *does not* alleviate the issue.

@dvyukov
Copy link
Member

dvyukov commented Mar 23, 2014

Comment 4:

Here is a similar crash on linux/race builder:
SIGSEGV: segmentation violation
PC=0x4070f9
goroutine 0 [idle]:
scanblock(0x7f40852df000, 0x7f4084aabc00)
    src/pkg/runtime/mgc0.c:948 +0x389
markroot(0xc208016480, 0x2b0000000b)
    src/pkg/runtime/mgc0.c:1344 +0xd9
runtime.parfordo(0xc208016480)
    src/pkg/runtime/parfor.c:88 +0xa3
gc(0x7f40842abc80)
    src/pkg/runtime/mgc0.c:2322 +0x183
http://build.golang.org/log/370a1f515e9904f35f8b975c85253b0ab9472efa
The line is:
        case GC_APTR:
>>>         obj = *(void**)(stack_top.b + pc[1]);
            pc += 2;
            break;

Labels changed: added release-go1.3.

Status changed to Accepted.

@4ad
Copy link
Member Author

4ad commented Mar 26, 2014

Comment 5:

This is made more obvious on the 24-way machine (it never happens on the 4-way machine).
Do you have a 24-way Linux machine to test?

@dvyukov
Copy link
Member

dvyukov commented Mar 26, 2014

Comment 6:

I have a 32-way linux machine, but I never seen such crashes. Any suggestions on how to
reproduce it?

@4ad
Copy link
Member Author

4ad commented Mar 26, 2014

Comment 7:

Just running GOMAXPROCS=n go test -cpu=1,2,4 -short runtime triggers it for me
(GOMAXPROCS=n is the crucial bit for me).

@4ad
Copy link
Member Author

4ad commented Mar 26, 2014

Comment 8:

Btw, my 24-way machine is quite loaded, it's not idle; I don't know if that's relevant.

@dvyukov
Copy link
Member

dvyukov commented Mar 26, 2014

Comment 9:

Just run
$ GOMAXPROCS=24 ./runtime.test -test.cpu=1,2,4 -test.short
300 times, no crashes.

@dvyukov
Copy link
Member

dvyukov commented Mar 26, 2014

Comment 10:

Also done 100 runs with race detector, no crashes.

@4ad
Copy link
Member Author

4ad commented Mar 26, 2014

Comment 11:

Interesting. Crashes reliably on loaded Solaris with that many CPUs. Could be port's
fault, but then again the crash in
http://build.golang.org/log/370a1f515e9904f35f8b975c85253b0ab9472efa is really similar;
stack_top.b is nil.

@4ad
Copy link
Member Author

4ad commented Mar 28, 2014

Comment 12:

Do you have any suggestions of what instrumentation could be added to further debug
this? I know nothing about the innards of the garbage collector.

@4ad
Copy link
Member Author

4ad commented Mar 28, 2014

Comment 13:

Ok, more info.
I re-tested on a different 24-way machine. This machine was idle,
had 80GB RAM. I got no failures, everything works smoothly. It fails
only on that other machine.
The difference between the machines is that the failing one has
various soft quotas regarding cpu/memory/io, perhaps most importantly,
it has a really low quota on memory utilisation.

@rsc
Copy link
Contributor

rsc commented Mar 28, 2014

Comment 14:

If stack_top.b is zero, then you pulled a zero out of the work buffers (see "Fetch b
from the work buffer"). Why is that happening? There aren't supposed to be nils in the
work buffer. In fact everything in the work buffer is supposed to be in the range
[arena_start, arena_end). Are those set correctly? Add checks to the code that inserts
pointers into the work buffer to make sure nil is not being inserted. If it is, why? If
it is not, why do we find a nil when we read back from the buffer? Is the memory being
zeroed halfway through the collection somehow?

@dvyukov
Copy link
Member

dvyukov commented Mar 31, 2014

Comment 15:

I don't have any suggestions significantly better than what Russ said.
I would start with assuming that we put NULL into workbuf (and that it's not zeroed
later). There is a number of places where we insert objects into workbuf w/o explicit
checks and assuming that it's not NULL. In particular enqueue1 function for roots, e.g.:
        enqueue1(&wbuf, (Obj){p, s->elemsize, 0});
        enqueue1(&wbuf, (Obj){(void*)&spf->fn, PtrSize, 0});
        enqueue1(&wbuf, (Obj){(void*)&spf->fint, PtrSize, 0});
        enqueue1(&wbuf, (Obj){(void*)&spf->ot, PtrSize, 0});
I would add NULL checks into enqueue1/enqueue/flushobjbuf/flushptrbuf functions. If the
check fires, then insert checks earlier, and so on.

@rsc
Copy link
Contributor

rsc commented Apr 3, 2014

Comment 16:

Until we have evidence otherwise, I'm assuming this is a Solaris-specific bug, so this
will not block the release.

Labels changed: added release-go1.3maybe, removed release-go1.3.

@rsc
Copy link
Contributor

rsc commented May 9, 2014

Comment 17:

Labels changed: added release-none, removed release-go1.3maybe.

@gopherbot
Copy link
Contributor

Comment 18 by [email protected]:

The issue can be mitigated by creating a processor set and bulding/running go on it:
psrset -c
[assuming you don't have any, a processor set 1 will be created]
psrset -a 1 0
psrset -a 1 1
psrset -e 1 <some go program>

@4ad
Copy link
Member Author

4ad commented Jul 9, 2014

Comment 19:

Note that this failure only occurs in the runtime tests and we have never seen it occur
in practice. Also, if you only want to build Go without running the tests, you can run
make.bash instead of all.bash.
Thanks for the tip regarding psrset.

Owner changed to @4ad.

@davecheney
Copy link
Contributor

Comment 20:

zheganin, I've seen similar symptoms on a joyent smartos machine which presents as
having something like 24 cores, but only a few gb of ram (and presumably little
overcommit). I've always assumed the failure was related not to a concurrency issue, but
a memory allocation failure (due to quotaring) when possibly 24 go test processes start
concurrently.

@4ad
Copy link
Member Author

4ad commented Sep 7, 2014

Comment 21:

Rewriting the runtime in Go fixed this.

Status changed to Fixed.

@4ad 4ad added fixed labels Sep 7, 2014
@golang golang locked and limited conversation to collaborators Jun 25, 2016
This issue was closed.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants