-
Notifications
You must be signed in to change notification settings - Fork 18k
runtime: Kubernetes' kube-proxy stuck in GC forever #18925
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Accidentally posted the issue before I completed editing, will update shortly with more details, sorry |
What I could get out of GDB (maybe I should try newer version?):
|
Most of kube-proxy source is here: https://github.com/kubernetes/kubernetes/tree/v1.5.2/pkg/proxy |
Per-process CPU usage (just pasting this from htop, see 9th column) -- first line is for the whole process, following lines correspond to threads:
|
Can you run it with environment That might be more useful to see Go's own output, rather than gdb's attempt. |
Thanks for the very detailed report. It looks like we're in a stack-scanning deadlock. Narrowing down to just the threads that are in scang (which happen to be exactly the threads that are high in top):
It would be really useful if I could see the contents of *gp for each of these threads. Could you get that from delve or gdb? |
As of GOTRACEBACK, ok, we will do this and wait for more occurrences. Just thought that I can do something with Some observations follow. I'm not an expert in go runtime, so I may be wrong. Most threads are in
6 remaining threads are in
Closer inspection reveals that
but delve can't show the stacks for them, complaining about null pointers Will try to grab gp details now |
Once you get the gp details, it would be really helpful to know their stack traces. I'm not sure why delve can't get them, but if you grab my gdbinit.py from https://github.com/aclements/my-dotfiles/blob/master/gdbinit.py, you can probably get their stacks by running |
gp details: https://gist.github.com/ivan4th/d031a21b5d1043d04f4b74108b4474ff |
Active scans:
All target Gs have waitreason "GC worker (idle)" and their start PCs are all the same, so these are GC workers. It's worth noting that we're in mark termation, so these really shouldn't be in status running. STW should have moved them to runnable. We're stuck because stack scan needs to get these out of running, but we can't preempt anything during STW (because it shouldn't be running at all!), so we're just spinning around. So, it actually could be useful to see the output of |
|
It may be related to the environment, got some errors when attaching to the process:
|
(yes, gdb just can't work with threads in this container... will try to fix it. meanwhile see |
Based on
I wouldn't totally trust the M and P numbers here, since these goroutines clearly aren't actually running, but these probably show the last M that was running each goroutine and whatever P that M was on when it was stopped for STW mark termination. These Gs correspond to the following Ms (same order):
And the following Ps:
It's interesting that all of these Ps have assigned mark workers that are not the mark worker that claims it was running on that P. It could just be that the M and P numbers for each G are bogus, or this could mean something. One good thing is that this is almost certainly not an issue in Go 1.8 because Go 1.8 doesn't scan stacks during mark termination. |
@ivan4th, another thing to do in gdb on the running image: Assuming you've still got my gdbinit.py loaded, run the def btfrom(g):
pc, sp = g["sched"]["pc"], g["sched"]["sp"]
oldsp, oldpc = gdb.parse_and_eval('$sp'), gdb.parse_and_eval('$pc')
try:
# TODO: This fails if we're not in the innermost frame.
gdb.execute('set $sp = %#x' % sp)
gdb.execute('set $pc = %#x' % pc)
gdb.execute('backtrace')
finally:
gdb.execute('set $sp = %#x' % oldsp)
gdb.execute('set $pc = %#x' % oldpc)
for goid in [21, 22, 23, 26, 27, 28]:
print "Goroutine", goid, "sched stack:"
btfrom(getg(n=goid))
print Hit Ctrl-D to run it and get back to the GDB prompt. This should dump out the stacks of these goroutines from where they actually stopped (ignoring their claim that they are in state "running"). |
Unfortunately I got this:
|
Actually the same error happens for each goroutine. |
Yeah, looking back at your dump of the gp's, their saved SPs are all 0. That might actually be a clue. I think the only thing that puts 0 there is returning from |
On further thought, I don't think this indicates anything. Looking at the P's assigned mark workers, they're all exactly in sequence, except for P 20, which would clearly have mark worker G 40 except that mark worker 40 is the one that disassociated and is driving mark termination. These are exactly the 48 mark workers, so nothing funny happened where on disassociated from a P and was replaced by another one. @ivan4th, another potentially useful thing you can get from GDB or delve is the value of |
|
|
We had the same issue with |
@aclements, what's the status here? Is there something you still want to do for Go 1.9? Or do you suspect it's already been fixed? |
@bradfitz, unfortunately I still have no idea what's going on here. But it may have been related to stack rescanning and 1.8 doesn't rescan stacks and 1.9 will remove the rescan mechanism entirely, so it's possible this is fixed by side-effect. @r7vme, what version of Go was your crashing |
@aclements not sure, but |
@r7vme @aclements on phone but #13507 should provide hints for how to find what Go version the binary was built with. (Maybe we should make a go-gettable tool for this.) |
|
Go-gettable tool: https://github.com/loeyt/go-version |
I thought kubernetes is still on 1.7.6, based on the release notes for k8s
version 1.6.6 that I read last week
…On Tue, 20 Jun 2017, 06:48 Damian Gryski ***@***.***> wrote:
Go-gettable tool: https://github.com/loeyt/go-version
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#18925 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAAcAx4peTc4LunhrXsKEMpmsyrjCebJks5sFt6lgaJpZM4L2pVl>
.
|
@josharian @aclements messed up stuff. Yes, we use kube-proxy from hyperkube v1.5.2_coreos.0
|
Based on kubernetes/kubernetes#38228, Kubernetes 1.7 should be based on Go 1.8, which I believe does not have this bug. Kubernetes 1.7 should be out next week (June 28th). Has anybody stress tested or run into this problem with the Kubernetes 1.7 beta? |
Closing this, assuming it's fixed in Kubernetes 1.7 (Go 1.8). Please comment if not and we'll re-open. |
Please answer these questions before submitting your issue. Thanks!
What version of Go are you using (
go version
)?go1.7.4 linux/amd64
What operating system and processor architecture are you using (
go env
)?(from k8s build container)
What did you do?
Running k8s cluster
What did you expect to see?
kube-proxy process running as expected
What did you see instead?
kube-proxy gets stuck in GC code, with no goroutines being scheduled
The problem here is that the problem is quite hard to reproduce, something similar happens on some nodes from time to time, may take some days to run into the problem. We currently have a process that's in the state described here running on one of the nodes. I'm pretty sure it's related to Go runtime but in any case I'm stuck trying to find a way to debug it. Any hints on what needs to be done to find out the cause of the problem would be very appreciated. I don't want to kill the process with SIGQUIT to retrieve goroutine info, so doing this with delve instead.
The process is running in docker container (debian jessie), GOMAXPROCS=48
Goroutine info: https://gist.github.com/ivan4th/17654f6fee35a38548502de4b6f68ce4
Thread info: https://gist.github.com/ivan4th/4596664ba1f935c500250f74ade5c162
Log output from another hung kube-proxy killed with SIGQUIT: https://gist.github.com/ivan4th/5da95ebf8986c6834bca35c9b4e7895b
The text was updated successfully, but these errors were encountered: