net/http: TestServerAllowsBlockingRemoteAddr flake on linux-amd64-race #19161

mdempsky · 2017-02-17T19:29:21Z

While trybotting https://go-review.googlesource.com/#/c/37149/

https://storage.googleapis.com/go-build-log/aa3dd3e6/linux-amd64-race_252fe50a.log

--- FAIL: TestServerAllowsBlockingRemoteAddr (1.68s)
	serve_test.go:959: Get http://127.0.0.1:41257: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
	serve_test.go:1008: response 1 addr = ""; want "RA:21.21.21.21:21"
1 second passes in backend, proxygone= false
FAIL
FAIL	net/http	13.395s

The text was updated successfully, but these errors were encountered:

bradfitz · 2017-03-08T22:31:36Z

And linux-ppc64le:

https://build.golang.org/log/f580613a303c908044253788826180d2c59337c1

--- FAIL: TestServerAllowsBlockingRemoteAddr (0.00s)
	serve_test.go:1000: response 1 addr = "RA:12.12.12.12:12"; want "RA:21.21.21.21:21"
FAIL
FAIL	net/http	6.647s

And I used to see it on Solaris a lot before we quadrupled their CPU from 0.5 to 2 vCPUs.

bradfitz · 2017-03-08T22:49:09Z

I spent some time looking at this test. It's not obviously doing anything racy or wrong.

Any eyeballs welcome.

josharian · 2017-03-10T01:19:14Z

This is showing up a lot. Mark as flaky for the time being?

bradfitz · 2017-03-10T01:28:13Z

Sure, we could do that. Feel free to send a CL? But while doing so, can you try to spot the flakiness? I apaprently failed.

josharian · 2017-03-10T06:00:11Z

I think that the test just gets starved. A few observations:

If you add

		time.Sleep(500 * time.Millisecond)

to the HandlerFunc near the beginning of the function, I reliably get a test failure:

--- FAIL: TestServerAllowsBlockingRemoteAddr (1.00s)
	serve_test.go:952: Request 1: Get http://127.0.0.1:60996: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
	serve_test.go:1001: response 1 addr = ""; want "RA:21.21.21.21:21"
FAIL

As further evidence of starvation, greplogs for this test going back several months yields this distribution of builders:

 90.08%   463 solaris-amd64-smartosbuildlet
  3.50%    18 plan9-arm
  2.33%    12 plan9-amd64-9front
  0.78%     4 netbsd-amd64-70
  0.58%     3 freebsd-arm-paulzhol
  0.58%     3 linux-ppc64-buildlet
  0.39%     2 linux-386-clang
  0.39%     2 linux-amd64-clang
  0.39%     2 linux-amd64-race
  0.39%     2 linux-ppc64le-buildlet
  0.39%     2 netbsd-386-minux
  0.19%     1 linux-amd64-noopt

and we know solaris was resource-starved until recently.

This test is one of many marked as parallel in net/http, which could be leading to the starvation.

We could try increasing the http client's timeout, but given that we're already blowing through a full second, I think a better plan is to not have this test be parallel. I'll send a CL to that effect.

gopherbot · 2017-03-10T06:05:35Z

CL https://golang.org/cl/38005 mentions this issue.

josharian · 2017-03-10T18:48:11Z

Re-opening and moving to unplanned, to investigate whether the test can be made robust while still marked parallel. See the discussion in CL 38005.

mundaym · 2017-03-13T19:55:32Z

Is it worth making a new issue (or repurposing this one fully) for the response 1 addr = "RA:12.12.12.12:12"; want "RA:21.21.21.21:21" failure mode? We've been seeing it a lot recently even after the parallelism was removed:

linux/amd64-clang: https://build.golang.org/log/dde822640469f46493857c984fa4a2dca551817f
linux/ppc64le: https://build.golang.org/log/f80618f409ad8e07c6acb503e66d1445644c7700
linux/ppc64le: https://build.golang.org/log/724f466c6e7f7f2f3a67dea9379cdb38a6bbb1e4

I took a quick glance at the test and it looks like this error means that the same message gets sent on both channels. It seems strange to me that resource starvation would be the root cause of such an issue, so maybe it is unrelated to the timeout problem?

bradfitz · 2017-03-13T20:00:17Z

We can reuse this issue.

josharian · 2017-03-13T22:34:43Z

Here are some recent failures, according to greplogs/fetchlogs:

2017-03-08T18:27:19-6fbedc1/linux-amd64-noopt
2017-03-08T18:52:12-d71f36b/linux-ppc64-buildlet
2017-03-08T18:58:30-29edf0f/linux-ppc64-buildlet
2017-03-08T22:05:52-23be728/linux-ppc64le-buildlet
2017-03-09T16:29:40-b2c391b/linux-amd64-clang
2017-03-09T16:42:46-3b68922/linux-amd64-clang
2017-03-09T18:36:49-e57350f/linux-ppc64le-buildlet
2017-03-09T21:34:14-e471ad9/linux-386-clang
2017-03-09T23:28:48-1f9f0ea/linux-386-clang
2017-03-09T23:39:01-c63ad97/linux-ppc64-buildlet
2017-03-11T00:39:23-2de773d/linux-amd64-clang
2017-03-11T01:19:13-fe34585/linux-ppc64le-buildlet
2017-03-13T04:00:12-dd0e1ac/linux-ppc64le-buildlet
2017-03-13T19:39:15-a51e4cc/netbsd-amd64-70
2017-03-13T20:22:39-2e7c3b3/linux-ppc64le-buildlet

In the hopes of bisecting, I just tried reproducing in a loop on a linux/amd64 gomote with tip and

gomote -user josharian run user-josharian-linux-amd64-0 go/bin/go test -count=1000000 -run=TestServerAllowsBlockingRemoteAddr -timeout=1h net/http

and got no failures. (I tried a similar thing locally on darwin/amd64 and also got no failures.) The linux-amd64-clang gomote won't boot at all.

I'm giving up for now, just leaving breadcrumbs for others. If anyone wants to stress test runs of this test or all net/http tests on some linux hardware and see if you can reliably reproduce on demand, that'd be a first step.

bradfitz · 2017-03-13T22:38:06Z

The linux-amd64-clang gomote won't boot at all.

Did you wait a full 40-60 seconds? It's a VM still, IIRC, and not a dedicated machine or container. (It could be a container, but I don't think it's been migrated)

josharian · 2017-03-13T22:51:43Z

gomote waited 180s and timed out.

bradfitz · 2017-03-13T22:58:38Z

Maybe there was a burst of commits and we were maxed out on GCE quota. Prioritizing some types of buildlet users over others is #19178.

siebenmann · 2017-03-18T01:15:08Z

I'm sporadically seeing the response 1 addr = "RA:12.12.12.12:12"; want "RA:21.21.21.21:21" test failures on my Linux machines. This is with physical hardware, not virtualized, and just running as a normal user on generally unloaded workstations.

GOARCH="amd64"
GOOS="linux"

bradfitz · 2017-03-18T01:17:19Z

@siebenmann, yeah, that one's been on the dashboard recently too.

I need to understand that test more. It was added in b58515b by @glasser. David, could you also take a look and figure out why it's flaky?

glasser · 2017-03-18T06:00:40Z

Hmm. I agree that if a machine is overloaded then the timeout could be too small, but people are seeing the wrong RemoteAddr being returned (the final check of the test) without any other errors? Did this start recently? The test in question is 1.5 years old. @siebenmann are you just seeing this in the context of bigger test runs or are you able to reproduce it on its own?

siebenmann · 2017-03-18T06:17:56Z

I've only seen this when building Go through ./all.bash, but I haven't tried to run this test outside of the build process. As far as I remember I've only seen this on its own (without other errors) and only occasionally, and I think it started happening only relatively recently (although I can't say how recently because I've ignored it before now).

(If the go build status is good in general, I've tended to assume that any test or build failures that I run into are specific to my machines, especially if they're sporadic.)

glasser · 2017-03-18T15:34:39Z

Could there be a recently introduced bug in DisableKeepAlives? The main way I could imagine that particular failure is if both client.Get calls used the same connection.

If that's the case one could de-flake by using separate Transports and Clients or perhaps by setting Connection: close on one side or the other, but there's probably an actual bug to fix...

glasser · 2017-03-18T15:35:58Z

Although hmm, that theory doesn't make much sense because we do verify that Accept happens twice.

glasser · 2017-03-18T15:42:04Z

Oh. It looks like 3b988eb lost the use of a Transport with DisableKeepAlives? I'm on my phone now but it doesn't look at a quick glance like ts.Client uses a non default Transport.

glasser · 2017-03-18T15:48:14Z

And if I'm correct, looks like a whole bunch of other tests lost DisableKeepAlives in that commit. I suppose some of them may have merely had the flag set to avoid overlapping with other tests on the same Transport but who knows how many other tests actually relied on it internally. I probably should have put a more explicit comment on that line.

glasser · 2017-03-18T15:52:03Z

Note that the commit in question merged after this issue was opened but before all the comments on it.

glasser · 2017-03-18T15:54:28Z

I'm still not clear on how the second Accept succeeded in this case. And the connection surely isn't "idle". But losing the flag is definitely wrong. Will stop updating a zillion times from my phone now :)

glasser · 2017-03-18T16:01:01Z

I actually don't have a great understanding of when http.Transport will send two requests in parallel on the same connection. (and is http/2 perhaps involved? Some of the buildlogs above have errors that mention http/2.)

siebenmann · 2017-03-19T05:15:01Z

I just had a build failure that also had a http2 test failure:

--- FAIL: TestServerAllowsBlockingRemoteAddr (0.00s)
        serve_test.go:1018: response 1 addr = "RA:12.12.12.12:12"; want "RA:21.21.21.21:21"
2017/03/19 01:09:21 http2: server: error reading preface from client 127.0.0.1:50232: read tcp 127.0.0.1:35221->127.0.0.1:50232: use of closed file or network connection
FAIL
FAIL    net/http        3.170s

As usual, redoing the build had everything go fine.

bradfitz · 2017-03-19T05:19:42Z

@siebenmann, that's log spam from an unrelated test I believe.

eliasnaur · 2017-03-20T14:04:20Z

FWIW, this is happening on darwin/arm64 as well:

https://build.golang.org/log/d6c544a1382073835ea733e4eaf094c243d915a3

gopherbot · 2017-03-20T17:01:05Z

CL https://golang.org/cl/38373 mentions this issue.

glasser · 2017-03-20T17:19:05Z

I'd love to dig in more to see if I can reproduce, but it's been a while since I've dabbled in Go core. Can somebody remind me how to run a specific test without having to run all of the big slow all.bash (the only thing mentioned in contribute.html)? Just normal go test . -run Blocking doesn't work, presumably because I'm running an unrelated version of Go. Are there docs somewhere that I'm missing?

As noted in #19161 (comment), CL 37771 (adding use of the new httptest.Server.Client to all net/http tests) accidentally reverted DisableKeepAlives for this test. For many tests, DisableKeepAlives was just present to prevent goroutines from staying active after the test exited. In this case it might actually be important. (We'll see) Updates #19161 Change-Id: I11f889f86c932b51b11846560b68dbe5993cdfc3 Reviewed-on: https://go-review.googlesource.com/38373 Reviewed-by: Michael Munday <[email protected]>

bradfitz · 2017-03-20T17:35:02Z

@glasser:

$ cd /path/to/devgo
$ cd src
$ ./make.bash
$ cd net/http
$ /path/to/devgo/bin/go test -v -run=Blocking -count=500

Or if you want to run it in race mode (often shakes things out):

$ /path/to/devgo/bin/go test -race -i    # install race deps once
$ /path/to/devgo/bin/go test -v -run=Blocking -count=500 -race

siebenmann · 2017-03-20T18:03:56Z

I can get a test version of Go at commit d80166e to repeatedly fail go test -run=Blocking -count=10000 with the usual serve_test.go:1018: response 1 addr = "RA:12.12.12.12:12"; want "RA:21.21.21.21:21" error. With the next commit, eb6c1dd 'net/http: deflake TestServerAllowsBlockingRemoteAddr more', my go test test run no longer produces any failures even if run for longer than 10,000 iterations. Nor can I get it to fail with -race.

bradfitz · 2017-03-20T18:08:42Z

@siebenmann, excellent. Thanks for confirming. I wasn't able to cause the failure so I was just hoping. Glad to hear it's fixed now. Will close for now. Hopefully nobody else hits this again.

glasser · 2017-03-20T18:16:06Z

@siebenmann I'm glad that seems to work! If you're interested in helping me figure out exactly why it was broken, do you think you could run glasser@223b373 and share what it prints on failure cases? The commit adds a bunch of t.Logs using httptrace.

siebenmann · 2017-03-20T18:48:35Z

I cloned your Go repo and checked out that commit, and I get:

--- FAIL: TestServerAllowsBlockingRemoteAddr (0.25s)
        serve_test.go:965: Getting conn for 127.0.0.1:46137 ...
        serve_test.go:965: ConnectStart: Connecting to tcp 127.0.0.1:46137 ...
        serve_test.go:965: Getting conn for 127.0.0.1:46137 ...
        serve_test.go:965: ConnectStart: Connecting to tcp 127.0.0.1:46137 ...
        serve_test.go:965: ConnectDone: connected to tcp 127.0.0.1:46137 = <nil>
        serve_test.go:965: got conn: {Conn:0xc42000e050 Reused:false WasIdle:false IdleTime:0s}
        serve_test.go:965: WroteRequest: {Err:<nil>}
        serve_test.go:965: first response byte
        serve_test.go:965: PutIdleConn = <nil>
        serve_test.go:965: got conn: {Conn:0xc42000e050 Reused:true WasIdle:false IdleTime:0s}
        serve_test.go:965: WroteRequest: {Err:<nil>}
        serve_test.go:965: first response byte
        serve_test.go:965: PutIdleConn = <nil>
        serve_test.go:1048: response 1 addr = "RA:12.12.12.12:12"; want "RA:21.21.21.21:21"
        serve_test.go:965: ConnectDone: connected to tcp 127.0.0.1:46137 = <nil>
--- FAIL: TestServerAllowsBlockingRemoteAddr (0.00s)
        serve_test.go:965: Getting conn for 127.0.0.1:33671 ...
        serve_test.go:965: ConnectStart: Connecting to tcp 127.0.0.1:33671 ...
        serve_test.go:965: Getting conn for 127.0.0.1:33671 ...
        serve_test.go:965: ConnectStart: Connecting to tcp 127.0.0.1:33671 ...
        serve_test.go:965: ConnectDone: connected to tcp 127.0.0.1:33671 = <nil>
        serve_test.go:965: got conn: {Conn:0xc420100028 Reused:false WasIdle:false IdleTime:0s}
        serve_test.go:965: WroteRequest: {Err:<nil>}
        serve_test.go:965: first response byte
        serve_test.go:965: PutIdleConn = <nil>
        serve_test.go:965: got conn: {Conn:0xc420100028 Reused:true WasIdle:false IdleTime:0s}
        serve_test.go:965: WroteRequest: {Err:<nil>}
        serve_test.go:965: ConnectDone: connected to tcp 127.0.0.1:33671 = <nil>
        serve_test.go:965: first response byte
        serve_test.go:965: PutIdleConn = <nil>
        serve_test.go:1048: response 1 addr = "RA:12.12.12.12:12"; want "RA:21.21.21.21:21"

(And more copies of them.)

Running with just go test -run=Blocking -count=500 doesn't seem to reliably trigger the failure for me, for what it's worth, although sometimes it does. I thought initially it might be related to the number of sockets in TIME_WAIT or just active, but I checked netstat -n --inet when I can get go test -run=Blocking -count=500 to fail some of the time and it doesn't seem to be this; I can start from what looks like zero localhost sockets in any state and it will still fail some times.

glasser · 2017-03-20T19:56:51Z

OK, so it clearly shows that the client is reusing the connection (Reused:true) which can certainly cause the observed behavior.

But the strange bit is that first response byte shouldn't happen until we send an address over one of the addrs channels, and that shouldn't happen until after we read conn2 = <-conns (or have a "Second Accept didn't happen" Fatal). And that shouldn't happen until something makes a second TCP connection to the port.

Is it possible that something else on the system is making a TCP connection to the server port at just the wrong time? I'm imagining something like

Server starts listening
Something random connects to the server over TCP; its net.Conn gets sent to conns
Client starts first request; its net.Conn gets sent to conns and its server blocks on RemoteAddr()
Test reads the random connection from conns
Client starts goroutine which will eventually get around to sending second request
Test reads the first request's net.Conn from conns
Test sends 12.12.12.12 over the first request's connection's addrs channel
Server returns response to first request and client sees it
That goroutine for the second request finally gets around to running the request. It sees an idle connection and happily uses it.
Server returns response for the second request, which uses the same 12.12.12.12 RemoteAddr since that's per-request.

This is all believable, if there were any reason to believe that somebody would make a TCP connection to the server port right around the time it started listening. Possibly it's even just the first request connecting twice for some error-retry reason?

bradfitz self-assigned this Mar 8, 2017

bradfitz added the Testing An issue that has been verified to require only test changes, not just a test failure. label Mar 8, 2017

bradfitz added this to the Go1.9 milestone Mar 8, 2017

bradfitz added the help wanted label Mar 8, 2017

gopherbot closed this as completed in d9fe233 Mar 10, 2017

josharian reopened this Mar 10, 2017

josharian modified the milestones: Unplanned, Go1.9 Mar 10, 2017

josharian modified the milestones: Go1.9, Unplanned Mar 13, 2017

bradfitz closed this as completed Mar 20, 2017

golang locked and limited conversation to collaborators Mar 20, 2018

gopherbot added the FrozenDueToAge label Mar 20, 2018

rsc unassigned bradfitz Jun 23, 2022

net/http: TestServerAllowsBlockingRemoteAddr flake on linux-amd64-race #19161

net/http: TestServerAllowsBlockingRemoteAddr flake on linux-amd64-race #19161

Comments

mdempsky commented Feb 17, 2017

bradfitz commented Mar 8, 2017

bradfitz commented Mar 8, 2017

josharian commented Mar 10, 2017

bradfitz commented Mar 10, 2017

josharian commented Mar 10, 2017

gopherbot commented Mar 10, 2017

josharian commented Mar 10, 2017

mundaym commented Mar 13, 2017 • edited Loading

bradfitz commented Mar 13, 2017

josharian commented Mar 13, 2017

bradfitz commented Mar 13, 2017

josharian commented Mar 13, 2017 • edited Loading

bradfitz commented Mar 13, 2017

siebenmann commented Mar 18, 2017

bradfitz commented Mar 18, 2017

glasser commented Mar 18, 2017

siebenmann commented Mar 18, 2017

glasser commented Mar 18, 2017

glasser commented Mar 18, 2017

glasser commented Mar 18, 2017

glasser commented Mar 18, 2017 • edited Loading

glasser commented Mar 18, 2017

glasser commented Mar 18, 2017 • edited Loading

glasser commented Mar 18, 2017

siebenmann commented Mar 19, 2017

bradfitz commented Mar 19, 2017

eliasnaur commented Mar 20, 2017

gopherbot commented Mar 20, 2017

glasser commented Mar 20, 2017

bradfitz commented Mar 20, 2017

siebenmann commented Mar 20, 2017

bradfitz commented Mar 20, 2017

glasser commented Mar 20, 2017

siebenmann commented Mar 20, 2017

glasser commented Mar 20, 2017

mundaym commented Mar 13, 2017 •

edited

Loading

josharian commented Mar 13, 2017 •

edited

Loading

glasser commented Mar 18, 2017 •

edited

Loading

glasser commented Mar 18, 2017 •

edited

Loading