-
Notifications
You must be signed in to change notification settings - Fork 18k
x/build/env/windows-amd64: optimize Windows start-up time #21153
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
CL https://golang.org/cl/51030 mentions this issue. |
@johnsonj, you can patch in https://go-review.googlesource.com/51030 |
Seems like network is often the slow and/or flaky thing, even for the 10.* addresses. I've seen (based on serial) that the buildlet can be downloaded and listening even before the 10.* address is pingable. If inbound networking takes a long time to come up, perhaps we need to move to a model where the new VMs connect out only and register themselves with the coordinator to have the coordinator reverse the TCP connection like our other non-VM builders. |
Here's a run where I see nothing from the buildlet.exe process on serial: But the network changes behavior. Is the network coming up at 07:02:22 and that's why conns are refused? Maybe stage0 already gave up waiting on the network by then?
|
Okay, I caught the weird case again: Here the buildlet comes up and is listening at 07:06:19, but the TCP is dead & 10.240.0.22 doesn't ping until 07:07:08. We have outbound Internet 1 minute before inbound 10.* network!?
But later:
|
Here it's only 10 seconds from outbound network to inbound network:
|
In the delayed run it looks like the network isn't coming up for a while. GCE doesn't provide two NICs for the different IPs. The machine gets an internal IP and the external IP is an address that forwards to the machine. So it seems very strange that it could download the buildlet but not be pinged. Could we add a write to serial port for stage0 or is that too heavy for stage0? |
I guess there are networking fast paths from VMs to Google services like GCS, in that the routes are already known or something?
Great idea. Before you rebuild the images, let me add that and also clean up the network-waiting code and make sure it's enabled for Windows. |
SGTM! |
CL https://golang.org/cl/51130 mentions this issue. |
Updates golang/go#21153 Change-Id: I59e77c191b817e3e6766977931324af13c10deb0 Reviewed-on: https://go-review.googlesource.com/51130 Reviewed-by: Jeff Johnson <[email protected]> Reviewed-by: Brad Fitzpatrick <[email protected]>
@bradfitz I'm getting a panic on launching the new stage0:
|
Well that's disturbing. Will look after my late lunch here. |
Does this mean anything to you, @ianlancetaylor or @aclements? This was a Windows binary I compiled from Linux at master. |
Updates golang/go#21153 Change-Id: I1d80424a3272e7ee21eb176b020273d5903e444b Reviewed-on: https://go-review.googlesource.com/51030 Reviewed-by: Sarah Adams <[email protected]>
@alexbrainman @ianlancetaylor the line we're crashing on is if stdcall5(_WSAGetOverlappedResult, op.pd.fd, uintptr(unsafe.Pointer(op)), uintptr(unsafe.Pointer(&qty)), 0, uintptr(unsafe.Pointer(&flags))) == 0 { And Is The only interesting and unique thing about this Go program on Windows is that it's running before/while the network is coming up, which is something we've never tested before. Did something change here in Go 1.9? |
CL https://golang.org/cl/51171 mentions this issue. |
While debugging a potential Go 1.9 windows networking crash, revert the use of Go 1.9's Time.Duration so we can still build for Go 1.8. Updates golang/go#21153 Change-Id: I4845910cd0ef376d4891a1802b0c9bcb8f7c5a0a Reviewed-on: https://go-review.googlesource.com/51171 Reviewed-by: Sarah Adams <[email protected]>
Brad rebuilt the binary in Go1.8(.?) and the crash does not repro |
CL https://golang.org/cl/51190 mentions this issue. |
Jeff thinks the stage0 binary is hogging the COM1 port and preventing the buildlet from using it. Updates golang/go#21153 Change-Id: I73e39eeed90269c179818d06864ab1c35ce9fa79 Reviewed-on: https://go-review.googlesource.com/51190 Reviewed-by: Brad Fitzpatrick <[email protected]>
Okay, latest results with
About 50 seconds from request to usable. Not terrible. But the stage0 binary was running at 30 seconds, so we spent 20 seconds waiting for the network to come up. And the stage0 doesn't log when it finally does get network. Let me add that, with a duration. |
Oh, never mind, the log statement is there above:
So 13 seconds for network. |
And 36 seconds waiting for network on my second run:
From
to
36 seconds seems like a damn long time. Is this DHCP not coming up in the VM until late in the boot process, or something on the GCE side? |
It's likely DHCP. We're auto-logging into the VM as soon as it will let us. Windows will let you log in before every service is running. Maybe some services could be disabled to reduce this time, but the critical metric is vm start request -> buildlet response which looks like 1min15sec. Not too bad. |
CL https://golang.org/cl/51192 mentions this issue. |
Good idea. We don't need a file server, etc.
I dunno, I'd prefer more like 30-40 seconds. We're already few seconds for Linux (GKE), close to 0 for dedicated machines, and around 30-40 seconds for the BSD. Windows is the slow one here. We can't fix this by sharding tests wider (without a lot more CPU) because this initial VM boot slows down the serial part of our build+test (the build). |
runtime/netpoll_windows.go has not changed meaningfully since 1.8. The layer above has moved from the net package to the internal/poll package, but any changes during that move were inadvertent. I don't fully understand how the Windows code works. As far as I can see it always sets up |
…o child Updates golang/go#21153 Updates golang/go#17167 Change-Id: Ibe575d295468235a16c01f3420a9597373ab3891 Reviewed-on: https://go-review.googlesource.com/51192 Reviewed-by: Brad Fitzpatrick <[email protected]>
@ianlancetaylor, thanks. I've pulled that discussion out to #21172. |
This is just GCE delays configuring the network. VMs start booting in seconds & running user code & serial works, but network doesn't work for like 30+ seconds. Known issue. Nothing we can do but wait for fixes. |
If we can't fix the startup time, can we keep a pool of ready to build windows machines running? |
I've been looking into Windows buildlet VM start-up latency and reliability.
I wrote a little tool to create an instance and log what's happening.
Will post details here.
/cc @johnsonj
The text was updated successfully, but these errors were encountered: