runtime: Kubernetes' kube-proxy stuck in GC forever #18925

ivan4th · 2017-02-03T18:22:08Z

Please answer these questions before submitting your issue. Thanks!

What version of Go are you using (`go version`)?

go1.7.4 linux/amd64

What operating system and processor architecture are you using (`go env`)?

GOBIN=""
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"
GOPATH="/go"
GORACE=""
GOROOT="/usr/local/go"
GOTOOLDIR="/usr/local/go/pkg/tool/linux_amd64"
CC="gcc"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp.k8s/go-build739048657=/tmp/go-build -gno-record-gcc-switches"
CXX="g++"
CGO_ENABLED="1"

(from k8s build container)

What did you do?

Running k8s cluster

What did you expect to see?

kube-proxy process running as expected

What did you see instead?

kube-proxy gets stuck in GC code, with no goroutines being scheduled

The problem here is that the problem is quite hard to reproduce, something similar happens on some nodes from time to time, may take some days to run into the problem. We currently have a process that's in the state described here running on one of the nodes. I'm pretty sure it's related to Go runtime but in any case I'm stuck trying to find a way to debug it. Any hints on what needs to be done to find out the cause of the problem would be very appreciated. I don't want to kill the process with SIGQUIT to retrieve goroutine info, so doing this with delve instead.

The process is running in docker container (debian jessie), GOMAXPROCS=48

Goroutine info: https://gist.github.com/ivan4th/17654f6fee35a38548502de4b6f68ce4
Thread info: https://gist.github.com/ivan4th/4596664ba1f935c500250f74ade5c162
Log output from another hung kube-proxy killed with SIGQUIT: https://gist.github.com/ivan4th/5da95ebf8986c6834bca35c9b4e7895b

The text was updated successfully, but these errors were encountered:

ivan4th · 2017-02-03T18:23:30Z

Accidentally posted the issue before I completed editing, will update shortly with more details, sorry

ivan4th · 2017-02-03T18:29:55Z

What I could get out of GDB (maybe I should try newer version?):

Program received signal SIGINT, Interrupt.
runtime.usleep () at /usr/local/go/src/runtime/sys_linux_amd64.s:95
95	/usr/local/go/src/runtime/sys_linux_amd64.s: No such file or directory.
(gdb) bt
#0  runtime.usleep () at /usr/local/go/src/runtime/sys_linux_amd64.s:95
#1  0x0000000000424f7d in runtime.getfull (~r0=0xc42002d228) at /usr/local/go/src/runtime/mgcwork.go:411
#2  0x000000000042479f in runtime.(*gcWork).get (w=0x42279d <runtime.greyobject+301>, ~r0=0) at /usr/local/go/src/runtime/mgcwork.go:199
#3  0x0000000000421e3a in runtime.gcDrain (gcw=0xc42002d228, flags=0) at /usr/local/go/src/runtime/mgcmark.go:982
#4  0x000000000041f579 in runtime.gchelper () at /usr/local/go/src/runtime/mgc.go:1819
#5  0x0000000000436a1f in runtime.stopm () at /usr/local/go/src/runtime/proc.go:1597
#6  0x00000000004377f8 in runtime.findrunnable (gp=0x43e30e <runtime.runqget+142>, inheritTime=false) at /usr/local/go/src/runtime/proc.go:2021
#7  0x000000000043836c in runtime.schedule () at /usr/local/go/src/runtime/proc.go:2120
#8  0x00000000004386e3 in runtime.park_m (gp=0xc420461040) at /usr/local/go/src/runtime/proc.go:2183
#9  0x000000000046111b in runtime.mcall () at /usr/local/go/src/runtime/asm_amd64.s:240
#10 0x0000000005cace00 in runtime.work ()
#11 0x00007ffca16997f0 in ?? ()
#12 0x0000000005cacea0 in ?? ()
#13 0x00007ffca16997e0 in ?? ()
#14 0x0000000000435914 in runtime.mstart () at /usr/local/go/src/runtime/proc.go:1096
#15 0x0000000000460fc5 in runtime.rt0_go () at /usr/local/go/src/runtime/asm_amd64.s:156
#16 0x0000000000000009 in ?? ()
#17 0x00007ffca16998f8 in ?? ()
#18 0x0000000000000009 in ?? ()
#19 0x00007ffca16998f8 in ?? ()
#20 0x0000000000000000 in ?? ()
(gdb) quit

ivan4th · 2017-02-03T18:31:09Z

Most of kube-proxy source is here: https://github.com/kubernetes/kubernetes/tree/v1.5.2/pkg/proxy

ivan4th · 2017-02-03T18:33:32Z

Per-process CPU usage (just pasting this from htop, see 9th column) -- first line is for the whole process, following lines correspond to threads:

   1 root       20   0 1612M 90760 49692 S 51.9  0.0 21h29:31 /hyperkube proxy ... 
   27 root       20   0 1612M 90760 49692 R  6.9  0.0  2h12:50 /hyperkube proxy ...
 4959 root       20   0 1612M 90760 49692 R  6.4  0.0  2h12:58 /hyperkube proxy ...
   14 root       20   0 1612M 90760 49692 R  6.0  0.0  2h15:43 /hyperkube proxy ...
34689 root       20   0 1612M 90760 49692 R  4.6  0.0  2h12:53 /hyperkube proxy ...
   18 root       20   0 1612M 90760 49692 R  4.1  0.0  2h13:15 /hyperkube proxy ...
   15 root       20   0 1612M 90760 49692 R  3.7  0.0  2h13:07 /hyperkube proxy ...
   17 root       20   0 1612M 90760 49692 S  0.9  0.0 18:41.65 /hyperkube proxy ...
   19 root       20   0 1612M 90760 49692 S  0.9  0.0 18:36.90 /hyperkube proxy ...
   16 root       20   0 1612M 90760 49692 S  0.9  0.0 18:44.86 /hyperkube proxy ...
 1285 root       20   0 1612M 90760 49692 S  0.9  0.0 18:38.19 /hyperkube proxy ...
 1829 root       20   0 1612M 90760 49692 S  0.9  0.0 18:43.81 /hyperkube proxy ...
   29 root       20   0 1612M 90760 49692 R  0.9  0.0 18:35.07 /hyperkube proxy ...
 4368 root       20   0 1612M 90760 49692 S  0.9  0.0 18:35.75 /hyperkube proxy ...
   20 root       20   0 1612M 90760 49692 S  0.9  0.0 18:41.37 /hyperkube proxy ...
   34 root       20   0 1612M 90760 49692 S  0.9  0.0 18:31.79 /hyperkube proxy ...
   12 root       20   0 1612M 90760 49692 S  0.9  0.0 18:40.26 /hyperkube proxy ...
   10 root       20   0 1612M 90760 49692 S  0.9  0.0 18:40.57 /hyperkube proxy ...
   13 root       20   0 1612M 90760 49692 S  0.9  0.0 18:39.82 /hyperkube proxy ...
   31 root       20   0 1612M 90760 49692 S  0.9  0.0 18:38.71 /hyperkube proxy ...
   28 root       20   0 1612M 90760 49692 S  0.9  0.0 18:38.91 /hyperkube proxy ...
   26 root       20   0 1612M 90760 49692 R  0.9  0.0 18:37.23 /hyperkube proxy ...
 1827 root       20   0 1612M 90760 49692 S  0.9  0.0 18:38.38 /hyperkube proxy ...
  733 root       20   0 1612M 90760 49692 S  0.9  0.0 18:36.00 /hyperkube proxy ...
  199 root       20   0 1612M 90760 49692 R  0.9  0.0 18:37.56 /hyperkube proxy ...
 1283 root       20   0 1612M 90760 49692 R  0.5  0.0 18:36.17 /hyperkube proxy ...
   30 root       20   0 1612M 90760 49692 S  0.5  0.0 18:41.68 /hyperkube proxy ...
47869 root       20   0 1612M 90760 49692 R  0.5  0.0 18:14.54 /hyperkube proxy ...
   25 root       20   0 1612M 90760 49692 S  0.5  0.0 18:37.63 /hyperkube proxy ...
   32 root       20   0 1612M 90760 49692 S  0.5  0.0 18:40.23 /hyperkube proxy ...
   22 root       20   0 1612M 90760 49692 S  0.5  0.0 18:38.95 /hyperkube proxy ...
    8 root       20   0 1612M 90760 49692 S  0.5  0.0 18:40.63 /hyperkube proxy ...

bradfitz · 2017-02-03T18:37:16Z

Can you run it with environment GOTRACEBACK=system set and then send it SIGQUIT when it gets into that state and capture the output?

That might be more useful to see Go's own output, rather than gdb's attempt.

aclements · 2017-02-03T18:39:09Z

Thanks for the very detailed report. It looks like we're in a stack-scanning deadlock.

Narrowing down to just the threads that are in scang (which happen to be exactly the threads that are high in top):

(dlv) thread 14
Switched from 13 to 14
(dlv) bt -full
 0  0x0000000000465237 in runtime.osyield
    at .usr/local/go/src/runtime/sys_linux_amd64.s:479
 1  0x000000000043500b in runtime.scang
    at .usr/local/go/src/runtime/proc.go:865
        gp = (*runtime.g)(0xc420221380)
        gcw = (*runtime.gcWork)(0xc420017c28)
        nextYield = 487396600593231
        i = 101070403879
 2  0x0000000000459e8f in runtime.markroot.func1
    at .usr/local/go/src/runtime/mgcmark.go:236
        selfScan = false
        userG = *runtime.g nil
        gp = (*runtime.g)(0xc420221380)
        gcw = (*runtime.gcWork)(0xc420017c28)

(dlv) thread 15
Switched from 14 to 15
(dlv) bt -full
 0  0x0000000000464e40 in runtime.nanotime
    at .usr/local/go/src/runtime/sys_linux_amd64.s:174
 1  0x0000000000434f66 in runtime.scang
    at .usr/local/go/src/runtime/proc.go:862
        gp = (*runtime.g)(0xc4202211e0)
        gcw = (*runtime.gcWork)(0xc420016728)
        nextYield = 487396600632703
        i = 98607213579
 2  0x0000000000459e8f in runtime.markroot.func1
    at .usr/local/go/src/runtime/mgcmark.go:236
        selfScan = false
        userG = *runtime.g nil
        gp = (*runtime.g)(0xc4202211e0)
        gcw = (*runtime.gcWork)(0xc420016728)

(dlv) thread 18
Switched from 17 to 18
(dlv) bt -full
 0  0x0000000000434885 in runtime.readgstatus
    at .usr/local/go/src/runtime/proc.go:646
        gp = (*runtime.g)(0xc420221ba0)
        ~r1 = 2
 1  0x0000000000434f7a in runtime.scang
    at .usr/local/go/src/runtime/proc.go:808
        gp = (*runtime.g)(0xc420221ba0)
        gcw = (*runtime.gcWork)(0xc42001bc28)
        nextYield = 487396600856415
        i = 98856423408
 2  0x0000000000459e8f in runtime.markroot.func1
    at .usr/local/go/src/runtime/mgcmark.go:236
        selfScan = false
        userG = *runtime.g nil
        gp = (*runtime.g)(0xc420221ba0)
        gcw = (*runtime.gcWork)(0xc42001bc28)

(dlv) thread 27
Switched from 26 to 27
(dlv) bt -full
 0  0x0000000000462d94 in runtime.procyield
    at .usr/local/go/src/runtime/asm_amd64.s:509
 1  0x0000000000434fe4 in runtime.scang
    at .usr/local/go/src/runtime/proc.go:863
        gp = (*runtime.g)(0xc420221520)
        gcw = (*runtime.gcWork)(0xc420022728)
        nextYield = 487396700665810
        i = 98411903700
 2  0x0000000000459e8f in runtime.markroot.func1
    at .usr/local/go/src/runtime/mgcmark.go:236
        selfScan = false
        userG = *runtime.g nil
        gp = (*runtime.g)(0xc420221520)
        gcw = (*runtime.gcWork)(0xc420022728)

(dlv) thread 4959
Switched from 4368 to 4959
(dlv) bt -full
 0  0x0000000000465237 in runtime.osyield
    at .usr/local/go/src/runtime/sys_linux_amd64.s:479
 1  0x000000000043500b in runtime.scang
    at .usr/local/go/src/runtime/proc.go:865
        gp = (*runtime.g)(0xc420221d40)
        gcw = (*runtime.gcWork)(0xc42001a728)
        nextYield = 487396701077361
        i = 98526672812
 2  0x0000000000459e8f in runtime.markroot.func1
    at .usr/local/go/src/runtime/mgcmark.go:236
        selfScan = false
        userG = *runtime.g nil
        gp = (*runtime.g)(0xc420221d40)
        gcw = (*runtime.gcWork)(0xc42001a728)

(dlv) thread 34689
Switched from 4959 to 34689
(dlv) bt -full
 0  0x0000000000434ff5 in runtime.scang
    at .usr/local/go/src/runtime/proc.go:807
        gp = (*runtime.g)(0xc420221a00)
        gcw = (*runtime.gcWork)(0xc420023c28)
        nextYield = 487396700975725
        i = 98723943990
 1  0x0000000000459e8f in runtime.markroot.func1
    at .usr/local/go/src/runtime/mgcmark.go:236
        selfScan = false
        userG = *runtime.g nil
        gp = (*runtime.g)(0xc420221a00)
        gcw = (*runtime.gcWork)(0xc420023c28)

It would be really useful if I could see the contents of *gp for each of these threads. Could you get that from delve or gdb?

ivan4th · 2017-02-03T18:40:52Z

As of GOTRACEBACK, ok, we will do this and wait for more occurrences. Just thought that I can do something with
the hung process I have currently meanwhile.
gp details - will paste in a moment.

Some observations follow. I'm not an expert in go runtime, so I may be wrong. Most threads are in gc.getfull(). Meanwhile:

(dlv) print runtime.work.nproc
32
(dlv) print runtime.work.nwait
26

6 remaining threads are in runtime.scang:

(dlv) bt
 0  0x0000000000465237 in runtime.osyield
    at .usr/local/go/src/runtime/sys_linux_amd64.s:479
 1  0x000000000043500b in runtime.scang
    at .usr/local/go/src/runtime/proc.go:865
 2  0x0000000000459e8f in runtime.markroot.func1
    at .usr/local/go/src/runtime/mgcmark.go:236
 3  0x00000000004611e9 in runtime.systemstack
    at .usr/local/go/src/runtime/asm_amd64.s:314
 4  0x0000000000420111 in runtime.markroot
    at .usr/local/go/src/runtime/mgcmark.go:241
 5  0x0000000000421cf9 in runtime.gcDrain
    at .usr/local/go/src/runtime/mgcmark.go:963
 6  0x000000000041f567 in runtime.gchelper
    at .usr/local/go/src/runtime/mgc.go:1819
 7  0x0000000000436a1a in runtime.stopm
    at .usr/local/go/src/runtime/proc.go:1597
 8  0x00000000004377f3 in runtime.findrunnable
    at .usr/local/go/src/runtime/proc.go:2021
 9  0x0000000000438367 in runtime.schedule
    at .usr/local/go/src/runtime/proc.go:2120
10  0x00000000004386de in runtime.park_m
    at .usr/local/go/src/runtime/proc.go:2183

Closer inspection reveals that gp.preemptscan && gp.preempt && gp.stackguard0 == stackPreempt condition is true in each of these places, but corresponding goroutines that must scan their stacks can't do this. You can see them in goroutine list, the look like this:

Goroutine 26 - User: .usr/local/go/src/runtime/mbitmap.go:388 runtime.heapBitsForObject (0x417350)

but delve can't show the stacks for them, complaining about null pointers

Will try to grab gp details now

aclements · 2017-02-03T18:50:32Z

Once you get the gp details, it would be really helpful to know their stack traces. I'm not sure why delve can't get them, but if you grab my gdbinit.py from https://github.com/aclements/my-dotfiles/blob/master/gdbinit.py, you can probably get their stacks by running btg $getg(N) where N is the goroutine ID. While you've got that loaded, it may also be helpful to have to output of print-sched (though it probably won't tell us anything you haven't already reported).

ivan4th · 2017-02-03T18:53:54Z

gp details: https://gist.github.com/ivan4th/d031a21b5d1043d04f4b74108b4474ff
Note that I did some c + Ctrl-cs since I captured the initial data, but all scang threads still stay in scang and getfull ones stay in getfull - this doesn't change.

aclements · 2017-02-03T19:25:15Z

Active scans:

thread	gp	goid	gstatus	target g's thread
14	0xc420221380	22	running	22
15	0xc4202211e0	21	running	14
18	0xc420221ba0	27	running	18
27	0xc420221520	23	running	26
4959	0xc420221d40	28	running	32
34689	0xc420221a00	26	running	25

All target Gs have waitreason "GC worker (idle)" and their start PCs are all the same, so these are GC workers.

It's worth noting that we're in mark termation, so these really shouldn't be in status running. STW should have moved them to runnable. We're stuck because stack scan needs to get these out of running, but we can't preempt anything during STW (because it shouldn't be running at all!), so we're just spinning around.

So, it actually could be useful to see the output of print-sched.

ivan4th · 2017-02-03T19:34:43Z

print-sched output is here https://gist.github.com/ivan4th/3e53dde8c1b6c3b559b6901aff176bec

ivan4th · 2017-02-03T19:39:18Z

btg doesn't work though :(

(gdb) btg $getg(21)
goroutine 21 stack:
thread 14 not found; stack may be incorrect (try import _ "rutime/cgo")
Python Exception <class 'gdb.MemoryError'> Cannot access memory at address 0x18:
Error occurred in Python command: Cannot access memory at address 0x18

It may be related to the environment, got some errors when attaching to the process:

(gdb) attach 1
Attaching to process 1
Reading symbols from /hyperkube...done.
warning: Missing auto-load scripts referenced in section .debug_gdb_scripts
of file /hyperkube
Use `info auto-load python-scripts [REGEXP]' to list them.
warning: .dynamic section for "/lib/x86_64-linux-gnu/libpthread.so.0" is not at the expected address (wrong library or version mismatch?)
Reading symbols from /lib/x86_64-linux-gnu/libpthread.so.0...Reading symbols from /usr/lib/debug//lib/x86_64-linux-gnu/libpthread-2.19.so...done.
done.
warning: Unable to find libthread_db matching inferior's thread library, thread debugging will not be available.
Loaded symbols for /lib/x86_64-linux-gnu/libpthread.so.0
Reading symbols from /lib/x86_64-linux-gnu/libdl.so.2...Reading symbols from /usr/lib/debug//lib/x86_64-linux-gnu/libdl-2.19.so...done.
done.
Loaded symbols for /lib/x86_64-linux-gnu/libdl.so.2
Reading symbols from /lib/x86_64-linux-gnu/libc.so.6...Reading symbols from /usr/lib/debug//lib/x86_64-linux-gnu/libc-2.19.so...done.
done.
Loaded symbols for /lib/x86_64-linux-gnu/libc.so.6
Reading symbols from /lib64/ld-linux-x86-64.so.2...Reading symbols from /usr/lib/debug//lib/x86_64-linux-gnu/ld-2.19.so...done.
done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Reading symbols from /lib/x86_64-linux-gnu/libnss_compat.so.2...Reading symbols from /usr/lib/debug//lib/x86_64-linux-gnu/libnss_compat-2.19.so...done.
done.
Loaded symbols for /lib/x86_64-linux-gnu/libnss_compat.so.2
Reading symbols from /lib/x86_64-linux-gnu/libnsl.so.1...Reading symbols from /usr/lib/debug//lib/x86_64-linux-gnu/libnsl-2.19.so...done.
done.
Loaded symbols for /lib/x86_64-linux-gnu/libnsl.so.1
Reading symbols from /lib/x86_64-linux-gnu/libnss_nis.so.2...Reading symbols from /usr/lib/debug//lib/x86_64-linux-gnu/libnss_nis-2.19.so...done.
done.
Loaded symbols for /lib/x86_64-linux-gnu/libnss_nis.so.2
Reading symbols from /lib/x86_64-linux-gnu/libnss_files.so.2...Reading symbols from /usr/lib/debug//lib/x86_64-linux-gnu/libnss_files-2.19.so...done.
done.
Loaded symbols for /lib/x86_64-linux-gnu/libnss_files.so.2
warning: Unable to find libthread_db matching inferior's thread library, thread debugging will not be available.

ivan4th · 2017-02-03T19:42:25Z

(yes, gdb just can't work with threads in this container... will try to fix it. meanwhile see print-sched output above)

aclements · 2017-02-03T20:58:56Z

Based on print-sched, these six goroutines that we're trying to scan are the only ones in "running":

      21 running, m 9, p 2, dedicated mark worker
         started at runtime.gcBgMarkWorker
         created by runtime.gcBgMarkStartWorkers
      22 running, m 17, p 8, dedicated mark worker
         started at runtime.gcBgMarkWorker
         created by runtime.gcBgMarkStartWorkers
      23 running, m 21, p 7, dedicated mark worker
         started at runtime.gcBgMarkWorker
         created by runtime.gcBgMarkStartWorkers
      26 running, m 20, p 3, dedicated mark worker
         started at runtime.gcBgMarkWorker
         created by runtime.gcBgMarkStartWorkers
      27 running, m 13, p 5, dedicated mark worker
         started at runtime.gcBgMarkWorker
         created by runtime.gcBgMarkStartWorkers
      28 running, m 27, p 9, idle mark worker
         started at runtime.gcBgMarkWorker
         created by runtime.gcBgMarkStartWorkers

I wouldn't totally trust the M and P numbers here, since these goroutines clearly aren't actually running, but these probably show the last M that was running each goroutine and whatever P that M was on when it was stopped for STW mark termination.

These Gs correspond to the following Ms (same order):

       9 PID 14, p 2, helpgc
      17 PID 22, p 8, helpgc
      21 PID 26, p 7, helpgc
      20 PID 25, p 3, helpgc
      13 PID 18, p 5, helpgc
      27 PID 32, p 9, helpgc

And the following Ps:

       2 gcstop, mark worker g 22
       8 gcstop, mark worker g 28
       7 gcstop, mark worker g 27
       3 gcstop, mark worker g 23
       5 gcstop, mark worker g 25
       9 gcstop, mark worker g 29

It's interesting that all of these Ps have assigned mark workers that are not the mark worker that claims it was running on that P. It could just be that the M and P numbers for each G are bogus, or this could mean something.

One good thing is that this is almost certainly not an issue in Go 1.8 because Go 1.8 doesn't scan stacks during mark termination.

aclements · 2017-02-03T21:10:21Z

@ivan4th, another thing to do in gdb on the running image:

Assuming you've still got my gdbinit.py loaded, run the python command at the GDB prompt and paste in the following:

def btfrom(g):
    pc, sp = g["sched"]["pc"], g["sched"]["sp"]
    oldsp, oldpc = gdb.parse_and_eval('$sp'), gdb.parse_and_eval('$pc')
    try:
        # TODO: This fails if we're not in the innermost frame.
        gdb.execute('set $sp = %#x' % sp)
        gdb.execute('set $pc = %#x' % pc)
        gdb.execute('backtrace')
    finally:
        gdb.execute('set $sp = %#x' % oldsp)
        gdb.execute('set $pc = %#x' % oldpc)

for goid in [21, 22, 23, 26, 27, 28]:
    print "Goroutine", goid, "sched stack:"
    btfrom(getg(n=goid))
    print

Hit Ctrl-D to run it and get back to the GDB prompt. This should dump out the stacks of these goroutines from where they actually stopped (ignoring their claim that they are in state "running").

ivan4th · 2017-02-03T21:56:04Z

Unfortunately I got this:

>Goroutine 21 sched stack:
Traceback (most recent call last):
  File "<string>", line 15, in <module>
  File "<string>", line 6, in btfrom
gdb.MemoryError: Cannot access memory at address 0x18
Error while executing Python code.

ivan4th · 2017-02-03T22:00:25Z

Actually the same error happens for each goroutine.

aclements · 2017-02-03T22:06:24Z

Yeah, looking back at your dump of the gp's, their saved SPs are all 0.

That might actually be a clue. I think the only thing that puts 0 there is returning from systemstack, which suggests that we somehow returned from systemstack on these and then didn't update their sched buffers again.

aclements · 2017-02-03T22:17:53Z

It's interesting that all of these Ps have assigned mark workers that are not the mark worker that claims it was running on that P. It could just be that the M and P numbers for each G are bogus, or this could mean something.

On further thought, I don't think this indicates anything. Looking at the P's assigned mark workers, they're all exactly in sequence, except for P 20, which would clearly have mark worker G 40 except that mark worker 40 is the one that disassociated and is driving mark termination. These are exactly the 48 mark workers, so nothing funny happened where on disassociated from a P and was replaced by another one.

@ivan4th, another potentially useful thing you can get from GDB or delve is the value of runtime.work, particularly runtime.work.rescan.list.

ivan4th · 2017-02-04T02:41:10Z

(dlv) print runtime.work.rescan.list
[]runtime.guintptr len: 51, cap: 1312, [842352693312,842352693728,842352694144,842352694560,842352694976,842352695392,842352695808,842352696224,842352696640,842355048448,842355049280,842355048864,842355050112,842355050528,842355051360,842355049696,842355051776,842355052192,842355052608,842355053440,842355053024,842355054272,842355054688,842355055104,842355055520,842355055936,842355050944,842355089408,842355089824,842355090656,842355090240,842355053856,842355091904,842355091072,842355092736,842355091488,842355093568,842355093984,842355093152,842355092320,842355094816,842355095232,842355095648,842355096480,842355096896,842355138560,842355096064,842355094400,842359741760,842366314528,842353073472]

ivan4th · 2017-02-04T02:42:12Z

struct { full uint64; empty uint64; pad0 [64]uint8; markrootNext uint32; markrootJobs uint32; nproc uint32; tstart int64; nwait uint32; ndone uint32; alldone runtime.note; nDataRoots int; nBSSRoots int; nSpanRoots int; nStackRoots int; nRescanRoots int; markrootDone bool; startSema uint32; markDoneSema uint32; bgMarkReady runtime.note; bgMarkDone uint32; mode runtime.gcMode; spans []*runtime.mspan; totaltime int64; bytesMarked uint64; initialHeapLive uint64; assistQueue struct { lock runtime.mutex; head runtime.guintptr; tail runtime.guintptr }; rescan struct { lock runtime.mutex; list []runtime.guintptr }; stwprocs int32; maxprocs int32; tSweepTerm int64; tMark int64; tMarkTerm int64; tEnd int64; pauseNS int64; pauseStart int64; heap0 uint64; heap1 uint64; heap2 uint64; heapGoal uint64 } {
        full: 0,
        empty: 9171967839421147533,
        pad0: [64]uint8 [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],
        markrootNext: 60,
        markrootJobs: 54,
        nproc: 32,
        tstart: 308925600233906,
        nwait: 26,
        ndone: 0,
        alldone: runtime.note {key: 0},
        nDataRoots: 0,
        nBSSRoots: 0,
        nSpanRoots: 0,
        nStackRoots: 0,
        nRescanRoots: 51,
        markrootDone: true,
        startSema: 1,
        markDoneSema: 1,
        bgMarkReady: runtime.note {key: 0},
        bgMarkDone: 0,
        mode: 0,
        spans: []*runtime.mspan len: 1977, cap: 8192, [
                *(*runtime.mspan)(0x7f4965e3e000),
                *(*runtime.mspan)(0x7f4965e3e0a0),
                *(*runtime.mspan)(0x7f4965e3e140),
                *(*runtime.mspan)(0x7f4965e3e1e0),
                *(*runtime.mspan)(0x7f4965e3e280),
                *(*runtime.mspan)(0x7f4965e3e320),
                *(*runtime.mspan)(0x7f4965e3e3c0),
                *(*runtime.mspan)(0x7f4965e3e460),
                *(*runtime.mspan)(0x7f4965e3e500),
                *(*runtime.mspan)(0x7f4965e3e5a0),
                *(*runtime.mspan)(0x7f4965e3e640),
                *(*runtime.mspan)(0x7f4965e3e6e0),
                *(*runtime.mspan)(0x7f4965e3e780),
                *(*runtime.mspan)(0x7f4965e3e820),
                *(*runtime.mspan)(0x7f4965e3e8c0),
                *(*runtime.mspan)(0x7f4965e3e960),
                *(*runtime.mspan)(0x7f4965e3ea00),
                *(*runtime.mspan)(0x7f4965e3eaa0),
                *(*runtime.mspan)(0x7f4965e3eb40),
                *(*runtime.mspan)(0x7f4965e3ebe0),
                *(*runtime.mspan)(0x7f4965e3ec80),
                *(*runtime.mspan)(0x7f4965e3ed20),
                *(*runtime.mspan)(0x7f4965e3edc0),
                *(*runtime.mspan)(0x7f4965e3ee60),
                *(*runtime.mspan)(0x7f4965e3ef00),
                *(*runtime.mspan)(0x7f4965e3efa0),
                *(*runtime.mspan)(0x7f4965e3f040),
                *(*runtime.mspan)(0x7f4965e3f0e0),
                *(*runtime.mspan)(0x7f4965e3f180),
                *(*runtime.mspan)(0x7f4965e3f220),
                *(*runtime.mspan)(0x7f4965e3f2c0),
                *(*runtime.mspan)(0x7f4965e3f360),
                *(*runtime.mspan)(0x7f4965e3f400),
                *(*runtime.mspan)(0x7f4965e3f4a0),
                *(*runtime.mspan)(0x7f4965e3f540),
                *(*runtime.mspan)(0x7f4965e3f5e0),
                *(*runtime.mspan)(0x7f4965e3f680),
                *(*runtime.mspan)(0x7f4965e3f720),
                *(*runtime.mspan)(0x7f4965e3f7c0),
                *(*runtime.mspan)(0x7f4965e3f860),
                *(*runtime.mspan)(0x7f4965e3f900),
                *(*runtime.mspan)(0x7f4965e3f9a0),
                *(*runtime.mspan)(0x7f4965e3fa40),
                *(*runtime.mspan)(0x7f4965e3fae0),
                *(*runtime.mspan)(0x7f4965e3fb80),
                *(*runtime.mspan)(0x7f4965e3fc20),
                *(*runtime.mspan)(0x7f4965e3fcc0),
                *(*runtime.mspan)(0x7f4965e3fd60),
                *(*runtime.mspan)(0x7f4965e3fe00),
                *(*runtime.mspan)(0x7f4965e3fea0),
                *(*runtime.mspan)(0x7f4965e3ff40),
                *(*runtime.mspan)(0x7f4965e3ffe0),
                *(*runtime.mspan)(0x7f4965e40080),
                *(*runtime.mspan)(0x7f4965e40120),
                *(*runtime.mspan)(0x7f4965e401c0),
                *(*runtime.mspan)(0x7f4965e40260),
                *(*runtime.mspan)(0x7f4965e40300),
                *(*runtime.mspan)(0x7f4965e403a0),
                *(*runtime.mspan)(0x7f4965e40440),
                *(*runtime.mspan)(0x7f4965e404e0),
                *(*runtime.mspan)(0x7f4965e40580),
                *(*runtime.mspan)(0x7f4965e40620),
                *(*runtime.mspan)(0x7f4965e406c0),
                *(*runtime.mspan)(0x7f4965e40760),
                ...+1913 more
        ],
        totaltime: 3313886148062,
        bytesMarked: 11343072,
        initialHeapLive: 22322304,
        assistQueue: struct { lock runtime.mutex; head runtime.guintptr; tail runtime.guintptr } {
                lock: (*runtime.mutex)(0x5cacdf8),
                head: 0,
                tail: 0,},
        rescan: struct { lock runtime.mutex; list []runtime.guintptr } {
                lock: (*runtime.mutex)(0x5cace10),
                list: []runtime.guintptr len: 51, cap: 1312, [842352693312,842352693728,842352694144,842352694560,842352694976,842352695392,842352695808,842352696224,842352696640,842355048448,842355049280,842355048864,842355050112,842355050528,842355051360,842355049696,842355051776,842355052192,842355052608,842355053440,842355053024,842355054272,842355054688,842355055104,842355055520,842355055936,842355050944,842355089408,842355089824,842355090656,842355090240,842355053856,842355091904,842355091072,842355092736,842355091488,842355093568,842355093984,842355093152,842355092320,842355094816,842355095232,842355095648,842355096480,842355096896,842355138560,842355096064,842355094400,842359741760,842366314528,842353073472],},
        stwprocs: 32,
        maxprocs: 48,
        tSweepTerm: 308925429283441,
        tMark: 308925429313122,
        tMarkTerm: 308925600224180,
        tEnd: 308922100569890,
        pauseNS: 29681,
        pauseStart: 308925600224180,
        heap0: 22322304,
        heap1: 22570272,
        heap2: 11411928,
        heapGoal: 23370880,}

r7vme · 2017-06-13T09:53:09Z

We had the same issue with kube-proxy at least two times.

bradfitz · 2017-06-15T01:31:18Z

@aclements, what's the status here? Is there something you still want to do for Go 1.9? Or do you suspect it's already been fixed?

aclements · 2017-06-19T13:13:33Z

@bradfitz, unfortunately I still have no idea what's going on here. But it may have been related to stack rescanning and 1.8 doesn't rescan stacks and 1.9 will remove the rescan mechanism entirely, so it's possible this is fixed by side-effect.

@r7vme, what version of Go was your crashing kube-proxy built with?

r7vme · 2017-06-19T16:59:40Z

@aclements not sure, but kube-dns --version says Kubernetes v1.4.0-alpha.0.1055+6de83d58537f57-dirty

josharian · 2017-06-19T19:06:48Z

@r7vme @aclements on phone but #13507 should provide hints for how to find what Go version the binary was built with. (Maybe we should make a go-gettable tool for this.)

ALTree · 2017-06-19T19:14:04Z

(gdb) p 'runtime.buildVersion'

dgryski · 2017-06-19T20:48:34Z

Go-gettable tool: https://github.com/loeyt/go-version

davecheney · 2017-06-19T22:55:02Z

I thought kubernetes is still on 1.7.6, based on the release notes for k8s version 1.6.6 that I read last week

…

On Tue, 20 Jun 2017, 06:48 Damian Gryski ***@***.***> wrote: Go-gettable tool: https://github.com/loeyt/go-version — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#18925 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAAcAx4peTc4LunhrXsKEMpmsyrjCebJks5sFt6lgaJpZM4L2pVl> .

r7vme · 2017-06-20T08:15:29Z

@josharian @aclements messed up stuff. Yes, we use kube-proxy from hyperkube v1.5.2_coreos.0

[root@0000dd1bb02a188c tmp]# gdb hyperkube
GNU gdb (GDB) Fedora 7.11.1-86.fc24
...
(gdb) p 'runtime.buildVersion'
$1 = 0x344629c "go1.7.4"

aclements · 2017-06-23T20:45:20Z

Based on kubernetes/kubernetes#38228, Kubernetes 1.7 should be based on Go 1.8, which I believe does not have this bug. Kubernetes 1.7 should be out next week (June 28th).

Has anybody stress tested or run into this problem with the Kubernetes 1.7 beta?

bradfitz · 2017-07-06T18:34:14Z

Closing this, assuming it's fixed in Kubernetes 1.7 (Go 1.8). Please comment if not and we'll re-open.

bradfitz changed the title ~~Kubernetes' kube-proxy stuck in GC forever~~ runtime: Kubernetes' kube-proxy stuck in GC forever Feb 3, 2017

bradfitz assigned aclements Feb 3, 2017

bradfitz added this to the Go1.9 milestone Feb 3, 2017

listomin mentioned this issue Feb 4, 2017

kube-proxy process silently hangs on one of nodes kubernetes/kubernetes#38372

Closed

bradfitz closed this as completed Jul 6, 2017

golang locked and limited conversation to collaborators Jul 6, 2018

gopherbot added the FrozenDueToAge label Jul 6, 2018

rsc unassigned aclements Jun 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runtime: Kubernetes' kube-proxy stuck in GC forever #18925

runtime: Kubernetes' kube-proxy stuck in GC forever #18925

ivan4th commented Feb 3, 2017 •

edited

Loading

ivan4th commented Feb 3, 2017

ivan4th commented Feb 3, 2017

ivan4th commented Feb 3, 2017

ivan4th commented Feb 3, 2017

bradfitz commented Feb 3, 2017

aclements commented Feb 3, 2017

ivan4th commented Feb 3, 2017 •

edited

Loading

aclements commented Feb 3, 2017

ivan4th commented Feb 3, 2017

aclements commented Feb 3, 2017

ivan4th commented Feb 3, 2017

ivan4th commented Feb 3, 2017

ivan4th commented Feb 3, 2017

aclements commented Feb 3, 2017

aclements commented Feb 3, 2017

ivan4th commented Feb 3, 2017

ivan4th commented Feb 3, 2017

aclements commented Feb 3, 2017

aclements commented Feb 3, 2017

ivan4th commented Feb 4, 2017

ivan4th commented Feb 4, 2017

r7vme commented Jun 13, 2017

bradfitz commented Jun 15, 2017

aclements commented Jun 19, 2017

r7vme commented Jun 19, 2017

josharian commented Jun 19, 2017

ALTree commented Jun 19, 2017

dgryski commented Jun 19, 2017

davecheney commented Jun 19, 2017 via email

r7vme commented Jun 20, 2017 •

edited

Loading

aclements commented Jun 23, 2017

bradfitz commented Jul 6, 2017

runtime: Kubernetes' kube-proxy stuck in GC forever #18925

runtime: Kubernetes' kube-proxy stuck in GC forever #18925

Comments

ivan4th commented Feb 3, 2017 • edited Loading

What version of Go are you using (go version)?

What operating system and processor architecture are you using (go env)?

What did you do?

What did you expect to see?

What did you see instead?

ivan4th commented Feb 3, 2017

ivan4th commented Feb 3, 2017

ivan4th commented Feb 3, 2017

ivan4th commented Feb 3, 2017

bradfitz commented Feb 3, 2017

aclements commented Feb 3, 2017

ivan4th commented Feb 3, 2017 • edited Loading

aclements commented Feb 3, 2017

ivan4th commented Feb 3, 2017

aclements commented Feb 3, 2017

ivan4th commented Feb 3, 2017

ivan4th commented Feb 3, 2017

ivan4th commented Feb 3, 2017

aclements commented Feb 3, 2017

aclements commented Feb 3, 2017

ivan4th commented Feb 3, 2017

ivan4th commented Feb 3, 2017

aclements commented Feb 3, 2017

aclements commented Feb 3, 2017

ivan4th commented Feb 4, 2017

ivan4th commented Feb 4, 2017

r7vme commented Jun 13, 2017

bradfitz commented Jun 15, 2017

aclements commented Jun 19, 2017

r7vme commented Jun 19, 2017

josharian commented Jun 19, 2017

ALTree commented Jun 19, 2017

dgryski commented Jun 19, 2017

davecheney commented Jun 19, 2017 via email

r7vme commented Jun 20, 2017 • edited Loading

aclements commented Jun 23, 2017

bradfitz commented Jul 6, 2017

ivan4th commented Feb 3, 2017 •

edited

Loading

What version of Go are you using (`go version`)?

What operating system and processor architecture are you using (`go env`)?

ivan4th commented Feb 3, 2017 •

edited

Loading

r7vme commented Jun 20, 2017 •

edited

Loading