Description
Hi!
What version of Go are you using (go version
)?
$ go version go version go1.13.1 linux/amd64
(seen from at least go1.10.4
)
Does this issue reproduce with the latest release?
Currently testing with 1.14.3
What operating system and processor architecture are you using (go env
)?
go env
Output
$ go env GO111MODULE="" GOARCH="amd64" GOBIN="" GOCACHE="/home/xfennec/.cache/go-build" GOENV="/home/xfennec/.config/go/env" GOEXE="" GOFLAGS="" GOHOSTARCH="amd64" GOHOSTOS="linux" GOINSECURE="" GONOPROXY="" GONOSUMDB="" GOOS="linux" GOPATH="/home/xfennec/Quiris/Go" GOPRIVATE="" GOPROXY="direct" GOROOT="/usr/lib/golang" GOSUMDB="off" GOTMPDIR="" GOTOOLDIR="/usr/lib/golang/pkg/tool/linux_amd64" GCCGO="gccgo" AR="ar" CC="gcc" CXX="g++" CGO_ENABLED="1" GOMOD="" CGO_CFLAGS="-g -O2" CGO_CPPFLAGS="" CGO_CXXFLAGS="-g -O2" CGO_FFLAGS="-g -O2" CGO_LDFLAGS="-g -O2" PKG_CONFIG="pkg-config" GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build988333864=/tmp/go-build -gno-record-gcc-switches"
What did you do?
We use a chain of two HTTP proxies written in Go to route incoming traffic to our virtual machines:
incoming traffic | v +-------------+ | primary | +--+-------+--+ | | +----+ +----+ HTTP/2 | | v v +---------------+ +---------------+ | secondary (1) | | secondary (2) | +---+---+---+---+ +---+---+---+---+ | | | | | | v v v v v v VMs, HTTP/1.1
What did you expect to see?
Stable service over time.
What did you see instead?
Every ~10 days, we detect a total stall of requests between our frontal ("primary") proxy and one of the secondary proxies. New requests are accepted, sent to the secondary, then to the VM, but the response is stalled somewhere between the secondary and the primary proxy.
All request to this secondary proxy will then wait forever (since all requests are multiplexed inside the same HTTP2 connection)
Restarting one of the proxies, or even killing the "faulty" HTTP2 connection (using ss -K
) between our proxies makes the service work perfectly again.
We've had this issue for months now, and I still fail to see any common pattern (time of the day, server load, requests, …) that would be the trigger. I've tried to write a few test/repro, without success either.
Here's a sample of the stack on the primary proxy during one of the outages:
goroutine 72163481 [sync.Cond.Wait, 5 minutes]: runtime.goparkunlock(...) /usr/lib/golang/src/runtime/proc.go:310 sync.runtime_notifyListWait(0xc001b55300, 0x0) /usr/lib/golang/src/runtime/sema.go:510 +0xf8 sync.(*Cond).Wait(0xc001b552f0) /usr/lib/golang/src/sync/cond.go:56 +0x9d net/http.(*http2pipe).Read(0xc001b552e8, 0xc0004e0000, 0x8000, 0x8000, 0x0, 0x0, 0x0) /usr/lib/golang/src/net/http/h2_bundle.go:3512 +0xa6 net/http.http2transportResponseBody.Read(0xc001b552c0, 0xc0004e0000, 0x8000, 0x8000, 0x0, 0x0, 0x0) /usr/lib/golang/src/net/http/h2_bundle.go:8448 +0xac net/http/httputil.(*ReverseProxy).copyBuffer(0xc00014f810, 0x7f4cf39cc878, 0xc0000100a0, 0x7f4cf3a120f8, 0xc001b552c0, 0xc0004e0000, 0x8000, 0x8000, 0xc0006876d0, 0x920ca887, ...) /usr/lib/golang/src/net/http/httputil/reverseproxy.go:408 +0xa7 net/http/httputil.(*ReverseProxy).copyResponse(0xc00014f810, 0x7f4cf39cc878, 0xc0000100a0, 0x7f4cf3a120f8, 0xc001b552c0, 0x0, 0x0, 0x0) /usr/lib/golang/src/net/http/httputil/reverseproxy.go:396 +0xb5 net/http/httputil.(*ReverseProxy).ServeHTTP(0xc00014f810, 0x859e00, 0xc0000100a0, 0xc00111fc00) /usr/lib/golang/src/net/http/httputil/reverseproxy.go:289 +0x7e2 main.(*ProxyServer).serveReverseProxy(0xc00006f1a0, 0xc00148e680, 0x7d6772, 0x5, 0x859e00, 0xc0000100a0, 0xc00111fc00, 0xc001f07100) /home/xfennec/Quiris/Go/src/github.com/OnitiFR/mulch/cmd/mulch-proxy/proxy.go:208 +0x277 main.(*ProxyServer).handleRequest(0xc00006f1a0, 0x859e00, 0xc0000100a0, 0xc00111fc00) /home/xfennec/Quiris/Go/src/github.com/OnitiFR/mulch/cmd/mulch-proxy/proxy.go:287 +0x608 net/http.HandlerFunc.ServeHTTP(0xc0000d98b0, 0x859e00, 0xc0000100a0, 0xc00111ac00) /usr/lib/golang/src/net/http/server.go:2007 +0x44 net/http.(*ServeMux).ServeHTTP(0xc0000124c0, 0x859e00, 0xc0000100a0, 0xc00111ac00) /usr/lib/golang/src/net/http/server.go:2387 +0x1bd net/http.serverHandler.ServeHTTP(0xc0001460e0, 0x859e00, 0xc0000100a0, 0xc00111ac00) /usr/lib/golang/src/net/http/server.go:2802 +0xa4 net/http.initNPNRequest.ServeHTTP(0x85a8c0, 0xc0018b72f0, 0xc000a38380, 0xc0001460e0, 0x859e00, 0xc0000100a0, 0xc00111ac00) /usr/lib/golang/src/net/http/server.go:3366 +0x8d net/http.(*http2serverConn).runHandler(0xc0008b9500, 0xc0000100a0, 0xc00111ac00, 0xc0003f52c0) /usr/lib/golang/src/net/http/h2_bundle.go:5713 +0x9f created by net/http.(*http2serverConn).processHeaders /usr/lib/golang/src/net/http/h2_bundle.go:5447 +0x4eb
At the same time, on the secondary proxy, I see this:
goroutine 32086483 [select, 5 minutes]: net/http.(*http2serverConn).writeDataFromHandler(0xc0024f7e00, 0xc0002b8d10, 0xc0007b2000, 0x53, 0x1000, 0xc000d4fc01, 0x1d, 0x6779ed) /usr/lib/golang/src/net/http/h2_bundle.go:4577 +0x23f net/http.(*http2responseWriterState).writeChunk(0xc001188380, 0xc0007b2000, 0x53, 0x1000, 0xc00170fdb8, 0x1, 0xc000bf0340) /usr/lib/golang/src/net/http/h2_bundle.go:6043 +0x475 net/http.http2chunkWriter.Write(0xc001188380, 0xc0007b2000, 0x53, 0x1000, 0xc00144c420, 0x855e20, 0xc0003e0c80) /usr/lib/golang/src/net/http/h2_bundle.go:5926 +0x49 bufio.(*Writer).Flush(0xc000bf1340, 0x0, 0x0) /usr/lib/golang/src/bufio/bufio.go:593 +0x75 net/http.(*http2responseWriter).Flush(0xc001582088) /usr/lib/golang/src/net/http/h2_bundle.go:6122 +0x42 net/http.(*http2responseWriter).handlerDone(0xc001582088) /usr/lib/golang/src/net/http/h2_bundle.go:6258 +0x46 net/http.(*http2serverConn).runHandler.func1(0xc001582088, 0xc00170ff67, 0xc0024f7e00) /usr/lib/golang/src/net/http/h2_bundle.go:5711 +0x2bf net/http.(*http2serverConn).runHandler(0xc0024f7e00, 0xc001582088, 0xc0020c2a00, 0xc0003509e0) /usr/lib/golang/src/net/http/h2_bundle.go:5715 +0xa5 created by net/http.(*http2serverConn).processHeaders /usr/lib/golang/src/net/http/h2_bundle.go:5447 +0x4eb
- how can I make this bug report more helpful?
- any idea for a sample code to replicate this issue without waiting 10 days each time?
- as a workaround, what would be the timeout/deadline setting at
Tranport
level to mitigate this issue?