-
Notifications
You must be signed in to change notification settings - Fork 18k
net/http: TestServerAllowsBlockingRemoteAddr flake on linux-amd64-race #19161
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
And linux-ppc64le: https://build.golang.org/log/f580613a303c908044253788826180d2c59337c1
And I used to see it on Solaris a lot before we quadrupled their CPU from 0.5 to 2 vCPUs. |
I spent some time looking at this test. It's not obviously doing anything racy or wrong. Any eyeballs welcome. |
This is showing up a lot. Mark as flaky for the time being? |
Sure, we could do that. Feel free to send a CL? But while doing so, can you try to spot the flakiness? I apaprently failed. |
I think that the test just gets starved. A few observations: If you add time.Sleep(500 * time.Millisecond) to the
As further evidence of starvation,
and we know solaris was resource-starved until recently. This test is one of many marked as parallel in net/http, which could be leading to the starvation. We could try increasing the http client's timeout, but given that we're already blowing through a full second, I think a better plan is to not have this test be parallel. I'll send a CL to that effect. |
CL https://golang.org/cl/38005 mentions this issue. |
Re-opening and moving to unplanned, to investigate whether the test can be made robust while still marked parallel. See the discussion in CL 38005. |
Is it worth making a new issue (or repurposing this one fully) for the linux/amd64-clang: https://build.golang.org/log/dde822640469f46493857c984fa4a2dca551817f I took a quick glance at the test and it looks like this error means that the same message gets sent on both channels. It seems strange to me that resource starvation would be the root cause of such an issue, so maybe it is unrelated to the timeout problem? |
We can reuse this issue. |
Here are some recent failures, according to greplogs/fetchlogs:
In the hopes of bisecting, I just tried reproducing in a loop on a linux/amd64 gomote with tip and
and got no failures. (I tried a similar thing locally on darwin/amd64 and also got no failures.) The linux-amd64-clang gomote won't boot at all. I'm giving up for now, just leaving breadcrumbs for others. If anyone wants to stress test runs of this test or all net/http tests on some linux hardware and see if you can reliably reproduce on demand, that'd be a first step. |
Did you wait a full 40-60 seconds? It's a VM still, IIRC, and not a dedicated machine or container. (It could be a container, but I don't think it's been migrated) |
gomote waited 180s and timed out. |
Maybe there was a burst of commits and we were maxed out on GCE quota. Prioritizing some types of buildlet users over others is #19178. |
I'm sporadically seeing the
|
@siebenmann, yeah, that one's been on the dashboard recently too. I need to understand that test more. It was added in b58515b by @glasser. David, could you also take a look and figure out why it's flaky? |
Hmm. I agree that if a machine is overloaded then the timeout could be too small, but people are seeing the wrong RemoteAddr being returned (the final check of the test) without any other errors? Did this start recently? The test in question is 1.5 years old. @siebenmann are you just seeing this in the context of bigger test runs or are you able to reproduce it on its own? |
I've only seen this when building Go through (If the go build status is good in general, I've tended to assume that any test or build failures that I run into are specific to my machines, especially if they're sporadic.) |
Could there be a recently introduced bug in DisableKeepAlives? The main way I could imagine that particular failure is if both client.Get calls used the same connection. If that's the case one could de-flake by using separate Transports and Clients or perhaps by setting Connection: close on one side or the other, but there's probably an actual bug to fix... |
Although hmm, that theory doesn't make much sense because we do verify that Accept happens twice. |
Oh. It looks like 3b988eb lost the use of a Transport with DisableKeepAlives? I'm on my phone now but it doesn't look at a quick glance like ts.Client uses a non default Transport. |
And if I'm correct, looks like a whole bunch of other tests lost DisableKeepAlives in that commit. I suppose some of them may have merely had the flag set to avoid overlapping with other tests on the same Transport but who knows how many other tests actually relied on it internally. I probably should have put a more explicit comment on that line. |
Note that the commit in question merged after this issue was opened but before all the comments on it. |
I'm still not clear on how the second Accept succeeded in this case. And the connection surely isn't "idle". But losing the flag is definitely wrong. Will stop updating a zillion times from my phone now :) |
I actually don't have a great understanding of when http.Transport will send two requests in parallel on the same connection. (and is http/2 perhaps involved? Some of the buildlogs above have errors that mention http/2.) |
I just had a build failure that also had a http2 test failure:
As usual, redoing the build had everything go fine. |
@siebenmann, that's log spam from an unrelated test I believe. |
FWIW, this is happening on darwin/arm64 as well: https://build.golang.org/log/d6c544a1382073835ea733e4eaf094c243d915a3 |
CL https://golang.org/cl/38373 mentions this issue. |
I'd love to dig in more to see if I can reproduce, but it's been a while since I've dabbled in Go core. Can somebody remind me how to run a specific test without having to run all of the big slow |
As noted in #19161 (comment), CL 37771 (adding use of the new httptest.Server.Client to all net/http tests) accidentally reverted DisableKeepAlives for this test. For many tests, DisableKeepAlives was just present to prevent goroutines from staying active after the test exited. In this case it might actually be important. (We'll see) Updates #19161 Change-Id: I11f889f86c932b51b11846560b68dbe5993cdfc3 Reviewed-on: https://go-review.googlesource.com/38373 Reviewed-by: Michael Munday <[email protected]>
Or if you want to run it in race mode (often shakes things out):
|
I can get a test version of Go at commit d80166e to repeatedly fail |
@siebenmann, excellent. Thanks for confirming. I wasn't able to cause the failure so I was just hoping. Glad to hear it's fixed now. Will close for now. Hopefully nobody else hits this again. |
@siebenmann I'm glad that seems to work! If you're interested in helping me figure out exactly why it was broken, do you think you could run glasser@223b373 and share what it prints on failure cases? The commit adds a bunch of t.Logs using httptrace. |
I cloned your Go repo and checked out that commit, and I get:
(And more copies of them.) Running with just |
OK, so it clearly shows that the client is reusing the connection (Reused:true) which can certainly cause the observed behavior. But the strange bit is that Is it possible that something else on the system is making a TCP connection to the server port at just the wrong time? I'm imagining something like
This is all believable, if there were any reason to believe that somebody would make a TCP connection to the server port right around the time it started listening. Possibly it's even just the first request connecting twice for some error-retry reason? |
While trybotting https://go-review.googlesource.com/#/c/37149/
https://storage.googleapis.com/go-build-log/aa3dd3e6/linux-amd64-race_252fe50a.log
The text was updated successfully, but these errors were encountered: