Skip to content

x/build/env/windows-amd64: optimize Windows start-up time #21153

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
bradfitz opened this issue Jul 25, 2017 · 28 comments
Closed

x/build/env/windows-amd64: optimize Windows start-up time #21153

bradfitz opened this issue Jul 25, 2017 · 28 comments
Labels
Builders x/build issues (builders, bots, dashboards) FrozenDueToAge
Milestone

Comments

@bradfitz
Copy link
Contributor

I've been looking into Windows buildlet VM start-up latency and reliability.

I wrote a little tool to create an instance and log what's happening.

Will post details here.

/cc @johnsonj

@gopherbot gopherbot added this to the Unreleased milestone Jul 25, 2017
@gopherbot gopherbot added the Builders x/build issues (builders, bots, dashboards) label Jul 25, 2017
@gopherbot
Copy link
Contributor

CL https://golang.org/cl/51030 mentions this issue.

@bradfitz
Copy link
Contributor Author

@johnsonj, you can patch in https://go-review.googlesource.com/51030

@bradfitz
Copy link
Contributor Author

Seems like network is often the slow and/or flaky thing, even for the 10.* addresses.

I've seen (based on serial) that the buildlet can be downloaded and listening even before the 10.* address is pingable.

If inbound networking takes a long time to come up, perhaps we need to move to a model where the new VMs connect out only and register themselves with the coordinator to have the coordinator reverse the TCP connection like our other non-VM builders.

@bradfitz
Copy link
Contributor Author

Here's a run where I see nothing from the buildlet.exe process on serial:

But the network changes behavior. Is the network coming up at 07:02:22 and that's why conns are refused? Maybe stage0 already gave up waiting on the network by then?

$ go install ./debugnewvm/ && debugnewvm -serial
2017/07/25 07:00:53 Creating debug-brad-windows-1500966053
2017/07/25 07:00:54 instance requested
2017/07/25 07:01:22 instance created
2017/07/25 07:01:22 got instance info
2017/07/25 07:01:22 Trying http://10.240.0.14 ...
2017/07/25 07:01:22 SERIAL: SeaBIOS (version 1.8.2-20170419_170401-google)
                            Total RAM Size = 0x00000000e6600000 = 3686 MiB
                            CPUs found: 4     Max CPUs supported: 4
                            found virtio-scsi at 0:3
                            virtio-scsi vendor='Google' product='PersistentDisk' rev='1' type=0 removable=0
                            virtio-scsi blksize=512 sectors=104857600 = 51200 MiB
                            drive 0x000f31a0: PCHS=0/0/0 translation=lba LCHS=1024/255/63 s=104857600
                            Booting from Hard Disk 0...
2017/07/25 07:01:27 Trying http://10.240.0.14: <nil>, Get http://10.240.0.14: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
2017/07/25 07:01:28 Trying http://10.240.0.14 ...
2017/07/25 07:01:28 SERIAL: 7/25/2017 7:01:27 AM UTC: GCE Agent started (version 3.5.1.0).
2017/07/25 07:01:30 SERIAL: 7/25/2017 7:01:29 AM UTC: Starting startup scripts (version 3.5.1.0).
2017/07/25 07:01:33 Trying http://10.240.0.14: <nil>, Get http://10.240.0.14: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
2017/07/25 07:01:34 Trying http://10.240.0.14 ...
2017/07/25 07:01:36 SERIAL: 7/25/2017 7:01:36 AM UTC: Finished running startup scripts.
2017/07/25 07:01:39 Trying http://10.240.0.14: <nil>, Get http://10.240.0.14: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
2017/07/25 07:01:40 Trying http://10.240.0.14 ...
2017/07/25 07:01:45 Trying http://10.240.0.14: <nil>, Get http://10.240.0.14: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
2017/07/25 07:01:46 Trying http://10.240.0.14 ...
2017/07/25 07:01:51 Trying http://10.240.0.14: <nil>, Get http://10.240.0.14: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
2017/07/25 07:01:52 Trying http://10.240.0.14 ...
2017/07/25 07:01:57 Trying http://10.240.0.14: <nil>, Get http://10.240.0.14: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
2017/07/25 07:01:58 Trying http://10.240.0.14 ...
2017/07/25 07:02:03 Trying http://10.240.0.14: <nil>, Get http://10.240.0.14: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
2017/07/25 07:02:04 Trying http://10.240.0.14 ...
2017/07/25 07:02:09 Trying http://10.240.0.14: <nil>, Get http://10.240.0.14: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
2017/07/25 07:02:10 Trying http://10.240.0.14 ...
2017/07/25 07:02:15 Trying http://10.240.0.14: <nil>, Get http://10.240.0.14: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
2017/07/25 07:02:16 Trying http://10.240.0.14 ...
2017/07/25 07:02:21 Trying http://10.240.0.14: <nil>, Get http://10.240.0.14: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
2017/07/25 07:02:22 Trying http://10.240.0.14 ...
2017/07/25 07:02:22 Trying http://10.240.0.14: <nil>, Get http://10.240.0.14: dial tcp 10.240.0.14:80: getsockopt: connection refused
2017/07/25 07:02:23 Trying http://10.240.0.14 ...
2017/07/25 07:02:23 Trying http://10.240.0.14: <nil>, Get http://10.240.0.14: dial tcp 10.240.0.14:80: getsockopt: connection refused
2017/07/25 07:02:24 Trying http://10.240.0.14 ...
2017/07/25 07:02:24 Trying http://10.240.0.14: <nil>, Get http://10.240.0.14: dial tcp 10.240.0.14:80: getsockopt: connection refused
2017/07/25 07:02:25 Trying http://10.240.0.14 ...
2017/07/25 07:02:25 Trying http://10.240.0.14: <nil>, Get http://10.240.0.14: dial tcp 10.240.0.14:80: getsockopt: connection refused
2017/07/25 07:02:26 Trying http://10.240.0.14 ...
2017/07/25 07:02:26 Trying http://10.240.0.14: <nil>, Get http://10.240.0.14: dial tcp 10.240.0.14:80: getsockopt: connection refused
2017/07/25 07:02:27 Trying http://10.240.0.14 ...
2017/07/25 07:02:27 Trying http://10.240.0.14: <nil>, Get http://10.240.0.14: dial tcp 10.240.0.14:80: getsockopt: connection refused
2017/07/25 07:02:28 Trying http://10.240.0.14 ...
...
2017/07/25 07:04:51 Trying http://10.240.0.14 ...
2017/07/25 07:04:51 Trying http://10.240.0.14: <nil>, Get http://10.240.0.14: dial tcp 10.240.0.14:80: getsockopt: connection refused
2017/07/25 07:04:52 Trying http://10.240.0.14 ...
2017/07/25 07:04:52 Trying http://10.240.0.14: <nil>, Get http://10.240.0.14: dial tcp 10.240.0.14:80: getsockopt: connection refused
2017/07/25 07:04:53 Trying http://10.240.0.14 ...
2017/07/25 07:04:53 Trying http://10.240.0.14: <nil>, Get http://10.240.0.14: dial tcp 10.240.0.14:80: getsockopt: connection refused
2017/07/25 07:04:54 Trying http://10.240.0.14 ...
2017/07/25 07:04:54 Trying http://10.240.0.14: <nil>, Get http://10.240.0.14: dial tcp 10.240.0.14:80: getsockopt: connection refused
2017/07/25 07:04:55 Trying http://10.240.0.14 ...
2017/07/25 07:04:55 Trying http://10.240.0.14: <nil>, Get http://10.240.0.14: dial tcp 10.240.0.14:80: getsockopt: connection refused

@bradfitz
Copy link
Contributor Author

bradfitz commented Jul 25, 2017

Okay, I caught the weird case again:

Here the buildlet comes up and is listening at 07:06:19, but the TCP is dead & 10.240.0.22 doesn't ping until 07:07:08. We have outbound Internet 1 minute before inbound 10.* network!?

$ go install ./debugnewvm/ && debugnewvm -serial
2017/07/25 07:05:38 Creating debug-brad-windows-1500966338
2017/07/25 07:05:40 instance requested
2017/07/25 07:05:46 instance created
2017/07/25 07:05:46 got instance info
2017/07/25 07:05:46 Trying http://10.240.0.22 ...
2017/07/25 07:05:46 SERIAL: SeaBIOS (version 1.8.2-20170419_170401-google)
                            Total RAM Size = 0x00000000e6600000 = 3686 MiB
                            CPUs found: 4     Max CPUs supported: 4
                            found virtio-scsi at 0:3
                            virtio-scsi vendor='Google' product='PersistentDisk' rev='1' type=0 removable=0
                            virtio-scsi blksize=512 sectors=104857600 = 51200 MiB
                            drive 0x000f31a0: PCHS=0/0/0 translation=lba LCHS=1024/255/63 s=104857600
                            Booting from Hard Disk 0...
2017/07/25 07:05:51 Trying http://10.240.0.22: <nil>, Get http://10.240.0.22: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
2017/07/25 07:05:52 Trying http://10.240.0.22 ...
2017/07/25 07:05:54 SERIAL: 7/25/2017 7:05:53 AM UTC: GCE Agent started (version 3.5.1.0).
2017/07/25 07:05:55 SERIAL: 7/25/2017 7:05:55 AM UTC: Starting startup scripts (version 3.5.1.0).
2017/07/25 07:05:57 Trying http://10.240.0.22: <nil>, Get http://10.240.0.22: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
2017/07/25 07:05:58 Trying http://10.240.0.22 ...
2017/07/25 07:06:01 SERIAL: 7/25/2017 7:06:01 AM UTC: Finished running startup scripts.
2017/07/25 07:06:03 Trying http://10.240.0.22: <nil>, Get http://10.240.0.22: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
2017/07/25 07:06:04 Trying http://10.240.0.22 ...
2017/07/25 07:06:09 Trying http://10.240.0.22: <nil>, Get http://10.240.0.22: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
2017/07/25 07:06:10 Trying http://10.240.0.22 ...
2017/07/25 07:06:15 Trying http://10.240.0.22: <nil>, Get http://10.240.0.22: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
2017/07/25 07:06:16 Trying http://10.240.0.22 ...
2017/07/25 07:06:19 SERIAL: 2017/07/25 07:06:19 buildlet starting.
2017/07/25 07:06:19 Will listen on :80
2017/07/25 07:06:19 Listening on :80 ...
2017/07/25 07:06:21 Trying http://10.240.0.22: <nil>, Get http://10.240.0.22: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
2017/07/25 07:06:22 Trying http://10.240.0.22 ...
2017/07/25 07:06:27 Trying http://10.240.0.22: <nil>, Get http://10.240.0.22: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
2017/07/25 07:06:28 Trying http://10.240.0.22 ...
2017/07/25 07:06:33 Trying http://10.240.0.22: <nil>, Get http://10.240.0.22: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
2017/07/25 07:06:34 Trying http://10.240.0.22 ...
2017/07/25 07:06:39 Trying http://10.240.0.22: <nil>, Get http://10.240.0.22: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
2017/07/25 07:06:40 Trying http://10.240.0.22 ...
2017/07/25 07:06:45 Trying http://10.240.0.22: <nil>, Get http://10.240.0.22: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
2017/07/25 07:06:46 Trying http://10.240.0.22 ...
2017/07/25 07:06:51 Trying http://10.240.0.22: <nil>, Get http://10.240.0.22: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
2017/07/25 07:06:52 Trying http://10.240.0.22 ...
2017/07/25 07:06:57 Trying http://10.240.0.22: <nil>, Get http://10.240.0.22: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
^Z
[1]+  Stopped                 debugnewvm -serial
bradfitz@gdev:~/src/golang.org/x/build/cmd$ ping 10.240.0.22
PING 10.240.0.22 (10.240.0.22) 56(84) bytes of data.
(nothing)

But later:

2017/07/25 07:07:08 Trying http://10.240.0.22 ...
2017/07/25 07:07:08 Trying http://10.240.0.22: &{200 OK 200 HTTP/1.1 1 1 map[Date:[Tue, 25 Jul 2017 07:07:08 GMT] Content-Length:[34] Content-Type:[text/plain; charset=utf-8]] 0xc4203a04c0 34 [] true false map[] 0xc420275400 <nil>}, <nil>
2017/07/25 07:07:08 WorkDir: C:\workdir, <nil>
2017/07/25 07:07:08 done.
2017/07/25 07:07:09 SERIAL: 2017/07/25 07:07:08 Halting in 1 second.
2017/07/25 07:07:10 SERIAL: 2017/07/25 07:07:09 Halting machine.
2017/07/25 07:07:09 Shutdown: No system-specific halt command run; will just end buildlet process.
2017/07/25 07:07:09 Ending buildlet process post-halt

@bradfitz
Copy link
Contributor Author

Here it's only 10 seconds from outbound network to inbound network:

2017/07/25 07:08:43 Creating debug-brad-windows-1500966523
2017/07/25 07:08:44 instance requested
2017/07/25 07:08:57 instance created
2017/07/25 07:08:57 got instance info
2017/07/25 07:08:57 Trying http://10.240.0.23 ...
2017/07/25 07:08:57 SERIAL: SeaBIOS (version 1.8.2-20170419_170401-google)
                            Total RAM Size = 0x00000000e6600000 = 3686 MiB
                            CPUs found: 4     Max CPUs supported: 4
                            found virtio-scsi at 0:3
                            virtio-scsi vendor='Google' product='PersistentDisk' rev='1' type=0 removable=0
                            virtio-scsi blksize=512 sectors=104857600 = 51200 MiB
                            drive 0x000f31a0: PCHS=0/0/0 translation=lba LCHS=1024/255/63 s=104857600
                            Booting from Hard Disk 0...
2017/07/25 07:09:02 Trying http://10.240.0.23: <nil>, Get http://10.240.0.23: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
2017/07/25 07:09:02 SERIAL: 7/25/2017 7:09:02 AM UTC: GCE Agent started (version 3.5.1.0).
2017/07/25 07:09:03 Trying http://10.240.0.23 ...
2017/07/25 07:09:04 SERIAL: 7/25/2017 7:09:04 AM UTC: Starting startup scripts (version 3.5.1.0).
2017/07/25 07:09:08 Trying http://10.240.0.23: <nil>, Get http://10.240.0.23: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
2017/07/25 07:09:09 Trying http://10.240.0.23 ...
2017/07/25 07:09:11 SERIAL: 7/25/2017 7:09:10 AM UTC: Finished running startup scripts.
2017/07/25 07:09:14 Trying http://10.240.0.23: <nil>, Get http://10.240.0.23: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
2017/07/25 07:09:15 Trying http://10.240.0.23 ...
2017/07/25 07:09:20 Trying http://10.240.0.23: <nil>, Get http://10.240.0.23: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
2017/07/25 07:09:21 Trying http://10.240.0.23 ...
2017/07/25 07:09:26 Trying http://10.240.0.23: <nil>, Get http://10.240.0.23: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
2017/07/25 07:09:27 Trying http://10.240.0.23 ...
2017/07/25 07:09:32 Trying http://10.240.0.23: <nil>, Get http://10.240.0.23: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
2017/07/25 07:09:33 Trying http://10.240.0.23 ...
2017/07/25 07:09:38 Trying http://10.240.0.23: <nil>, Get http://10.240.0.23: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
2017/07/25 07:09:39 Trying http://10.240.0.23 ...
2017/07/25 07:09:42 SERIAL: 2017/07/25 07:09:42 buildlet starting.
2017/07/25 07:09:42 Will listen on :80
2017/07/25 07:09:42 Listening on :80 ...
2017/07/25 07:09:44 Trying http://10.240.0.23: <nil>, Get http://10.240.0.23: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
2017/07/25 07:09:45 Trying http://10.240.0.23 ...
2017/07/25 07:09:50 Trying http://10.240.0.23: <nil>, Get http://10.240.0.23: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
2017/07/25 07:09:51 Trying http://10.240.0.23 ...
2017/07/25 07:09:54 Trying http://10.240.0.23: &{200 OK 200 HTTP/1.1 1 1 map[Date:[Tue, 25 Jul 2017 07:09:54 GMT] Content-Length:[34] Content-Type:[text/plain; charset=utf-8]] 0xc420383060 34 [] true false map[] 0xc420467800 <nil>}, <nil>
2017/07/25 07:09:54 WorkDir: C:\workdir, <nil>
2017/07/25 07:09:54 done.
2017/07/25 07:09:55 SERIAL: 2017/07/25 07:09:54 Halting in 1 second.
2017/07/25 07:09:56 SERIAL: 2017/07/25 07:09:55 Halting machine.
2017/07/25 07:09:55 Shutdown: No system-specific halt command run; will just end buildlet process.
2017/07/25 07:09:55 Ending buildlet process post-halt

@johnsonj
Copy link
Contributor

In the delayed run it looks like the network isn't coming up for a while. GCE doesn't provide two NICs for the different IPs. The machine gets an internal IP and the external IP is an address that forwards to the machine. So it seems very strange that it could download the buildlet but not be pinged.

Could we add a write to serial port for stage0 or is that too heavy for stage0?

@bradfitz
Copy link
Contributor Author

So it seems very strange that it could download the buildlet but not be pinged.

I guess there are networking fast paths from VMs to Google services like GCS, in that the routes are already known or something?

Could we add a write to serial port for stage0 or is that too heavy for stage0?

Great idea.

Before you rebuild the images, let me add that and also clean up the network-waiting code and make sure it's enabled for Windows.

@johnsonj
Copy link
Contributor

SGTM!

@gopherbot
Copy link
Contributor

CL https://golang.org/cl/51130 mentions this issue.

gopherbot pushed a commit to golang/build that referenced this issue Jul 25, 2017
Updates golang/go#21153

Change-Id: I59e77c191b817e3e6766977931324af13c10deb0
Reviewed-on: https://go-review.googlesource.com/51130
Reviewed-by: Jeff Johnson <[email protected]>
Reviewed-by: Brad Fitzpatrick <[email protected]>
@johnsonj
Copy link
Contributor

@bradfitz I'm getting a panic on launching the new stage0:

fatal error: unexpected signal during runtime execution
[signal 0xc0000005 code=0x0 addr=0xffffffffffffffff pc=0x427c12]

runtime stack:
runtime.throw(0x695d84, 0x2a)
        /home/bradfitz/go/src/runtime/panic.go:605 +0x9c
runtime.sigpanic()
        /home/bradfitz/go/src/runtime/signal_windows.go:155 +0x184
runtime.netpoll(0xc042016000, 0x0)
        /home/bradfitz/go/src/runtime/netpoll_windows.go:105 +0x332
runtime.findrunnable(0xc042016000, 0x0)
        /home/bradfitz/go/src/runtime/proc.go:1966 +0x90f
runtime.schedule()
        /home/bradfitz/go/src/runtime/proc.go:2245 +0x13a
runtime.mstart1()
        /home/bradfitz/go/src/runtime/proc.go:1182 +0xcd
runtime.mstart()
        /home/bradfitz/go/src/runtime/proc.go:1142 +0x6b

goroutine 1 [runnable]:
net/http.(*Transport).RoundTrip(0xc0420f0000, 0xc0420f4100, 0xc0420f0000, 0xbe56
069a33e72c04, 0x13197476c)
        /home/bradfitz/go/src/net/http/transport.go:341 +0xf86
net/http.send(0xc0420f4000, 0x7aaac0, 0xc0420f0000, 0xbe56069a33e72c04, 0x131974
76c, 0x7e0260, 0xc04206a048, 0xbe56069a33e72c04, 0xc042067ac8, 0x1)
        /home/bradfitz/go/src/net/http/client.go:249 +0x1b0
net/http.(*Client).send(0xc0420748d0, 0xc0420f4000, 0xbe56069a33e72c04, 0x131974
76c, 0x7e0260, 0xc04206a048, 0x0, 0x1, 0xc042067b50)
        /home/bradfitz/go/src/net/http/client.go:173 +0x104
net/http.(*Client).Do(0xc0420748d0, 0xc0420f4000, 0x693570, 0x21, 0x0)
        /home/bradfitz/go/src/net/http/client.go:602 +0x294
net/http.(*Client).Get(0xc0420748d0, 0x693570, 0x21, 0xc04204a3e0, 0x48a76d, 0x5
977a2e3)
        /home/bradfitz/go/src/net/http/client.go:393 +0xae
main.isNetworkUp(0xbe560698f3e72c04)
        /home/bradfitz/src/golang.org/x/build/cmd/buildlet/stage0/stage0.go:204
+0xab
main.awaitNetwork(0x690700)
        /home/bradfitz/src/golang.org/x/build/cmd/buildlet/stage0/stage0.go:181
+0x16a
main.main()
        /home/bradfitz/src/golang.org/x/build/cmd/buildlet/stage0/stage0.go:82 +
0xd2

goroutine 19 [runnable]:
net/http.setRequestCancel.func3(0x0, 0xc0420749c0, 0xc0420482c0, 0xc04204c42c, 0
xc04203e240)
        /home/bradfitz/go/src/net/http/client.go:319
created by net/http.setRequestCancel
        /home/bradfitz/go/src/net/http/client.go:319 +0x2c6

C:\golang>

@bradfitz
Copy link
Contributor Author

Well that's disturbing. Will look after my late lunch here.

@bradfitz
Copy link
Contributor Author

Does this mean anything to you, @ianlancetaylor or @aclements?

This was a Windows binary I compiled from Linux at master.

gopherbot pushed a commit to golang/build that referenced this issue Jul 25, 2017
Updates golang/go#21153

Change-Id: I1d80424a3272e7ee21eb176b020273d5903e444b
Reviewed-on: https://go-review.googlesource.com/51030
Reviewed-by: Sarah Adams <[email protected]>
@bradfitz
Copy link
Contributor Author

@alexbrainman @ianlancetaylor the line we're crashing on is src/runtime/netpoll_windows.go:105 at:

                        if stdcall5(_WSAGetOverlappedResult, op.pd.fd, uintptr(unsafe.Pointer(op)), uintptr(unsafe.Pointer(&qty)), 0, uintptr(unsafe.Pointer(&flags))) == 0 {

And 0xc0000005 is EXCEPTION_ACCESS_VIOLATION I believe.

Is op or op.fd nil?

The only interesting and unique thing about this Go program on Windows is that it's running before/while the network is coming up, which is something we've never tested before.

Did something change here in Go 1.9?

@gopherbot
Copy link
Contributor

CL https://golang.org/cl/51171 mentions this issue.

gopherbot pushed a commit to golang/build that referenced this issue Jul 25, 2017
While debugging a potential Go 1.9 windows networking crash, revert
the use of Go 1.9's Time.Duration so we can still build for Go 1.8.

Updates golang/go#21153

Change-Id: I4845910cd0ef376d4891a1802b0c9bcb8f7c5a0a
Reviewed-on: https://go-review.googlesource.com/51171
Reviewed-by: Sarah Adams <[email protected]>
@johnsonj
Copy link
Contributor

Brad rebuilt the binary in Go1.8(.?) and the crash does not repro

@gopherbot
Copy link
Contributor

CL https://golang.org/cl/51190 mentions this issue.

gopherbot pushed a commit to golang/build that referenced this issue Jul 25, 2017
Jeff thinks the stage0 binary is hogging the COM1 port and preventing
the buildlet from using it.

Updates golang/go#21153

Change-Id: I73e39eeed90269c179818d06864ab1c35ce9fa79
Reviewed-on: https://go-review.googlesource.com/51190
Reviewed-by: Brad Fitzpatrick <[email protected]>
@bradfitz
Copy link
Contributor Author

Okay, latest results with windows-amd64-server-2012r2-testing:

$ debugnewvm 
2017/07/25 23:58:58 Creating debug-temp-1501027138
2017/07/25 23:58:59 instance requested
2017/07/25 23:59:08 instance created
2017/07/25 23:59:08 got instance info
2017/07/25 23:59:08 About to hit http://10.240.0.15 to see if buildlet is up yet...
2017/07/25 23:59:08 SERIAL: SeaBIOS (version 1.8.2-20170419_170401-google)
                            Total RAM Size = 0x00000000e6600000 = 3686 MiB
                            CPUs found: 4     Max CPUs supported: 4
                            found virtio-scsi at 0:3
                            virtio-scsi vendor='Google' product='PersistentDisk' rev='1' type=0 removable=0
                            virtio-scsi blksize=512 sectors=104857600 = 51200 MiB
                            drive 0x000f31a0: PCHS=0/0/0 translation=lba LCHS=1024/255/63 s=104857600
                            Booting from Hard Disk 0...
2017/07/25 23:59:13 client buildlet probe error: Get http://10.240.0.15: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
2017/07/25 23:59:14 About to hit http://10.240.0.15 to see if buildlet is up yet...
2017/07/25 23:59:15 SERIAL: 2017/07/25 23:59:14 GCEWindowsAgent: GCE Agent Started (version 4.1.1@4)
2017/07/25 23:59:16 SERIAL: 2017/07/25 23:59:16 GCEMetadataScripts: Starting startup scripts (version 4.1.6@1).
2017/07/25 23:59:16 GCEMetadataScripts: No startup scripts to run.
2017/07/25 23:59:19 client buildlet probe error: Get http://10.240.0.15: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
2017/07/25 23:59:20 About to hit http://10.240.0.15 to see if buildlet is up yet...
2017/07/25 23:59:25 client buildlet probe error: Get http://10.240.0.15: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
2017/07/25 23:59:26 About to hit http://10.240.0.15 to see if buildlet is up yet...
2017/07/25 23:59:31 SERIAL: stage0: 2017/07/25 23:59:30 bootstrap binary running.
stage0: 2017/07/25 23:59:30 waiting for network.
stage0: 2017/07/25 23:59:30 network still down; probe failure took 0s
2017/07/25 23:59:31 client buildlet probe error: Get http://10.240.0.15: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
2017/07/25 23:59:32 About to hit http://10.240.0.15 to see if buildlet is up yet...
2017/07/25 23:59:36 SERIAL: stage0: 2017/07/25 23:59:35 network still down; probe failure took 0s
2017/07/25 23:59:37 client buildlet probe error: Get http://10.240.0.15: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
2017/07/25 23:59:38 About to hit http://10.240.0.15 to see if buildlet is up yet...
2017/07/25 23:59:41 SERIAL: stage0: 2017/07/25 23:59:40 network still down; probe failure took 0s
2017/07/25 23:59:43 client buildlet probe error: Get http://10.240.0.15: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
2017/07/25 23:59:44 About to hit http://10.240.0.15 to see if buildlet is up yet...
2017/07/25 23:59:44 buildlet probe: 200 OK
2017/07/25 23:59:44 WorkDir: C:\workdir, <nil>
2017/07/25 23:59:44 done.
2017/07/25 23:59:44 SERIAL: stage0: 2017/07/25 23:59:43 network is up.
stage0: 2017/07/25 23:59:43 downloading http://storage.googleapis.com/go-builder-data/buildlet.windows-amd64 to .\buildlet.exe ...
stage0: 2017/07/25 23:59:43 downloaded .\buildlet.exe (7435776 bytes)
2017/07/25 23:59:44 buildlet starting.
2017/07/25 23:59:44 Will listen on :80
2017/07/25 23:59:44 Listening on :80 ...
2017/07/25 23:59:44 Halting in 1 second.
2017/07/25 23:59:46 SERIAL: 2017/07/25 23:59:45 Halting machine.
2017/07/25 23:59:45 Shutdown: No system-specific halt command run; will just end buildlet process.
2017/07/25 23:59:45 Ending buildlet process post-halt

About 50 seconds from request to usable.

Not terrible.

But the stage0 binary was running at 30 seconds, so we spent 20 seconds waiting for the network to come up.

And the stage0 doesn't log when it finally does get network. Let me add that, with a duration.

@bradfitz
Copy link
Contributor Author

Oh, never mind, the log statement is there above:

2017/07/25 23:59:31 SERIAL: stage0: 2017/07/25 23:59:30 bootstrap binary running.
...
2017/07/25 23:59:44 SERIAL: stage0: 2017/07/25 23:59:43 network is up.

So 13 seconds for network.

@bradfitz
Copy link
Contributor Author

And 36 seconds waiting for network on my second run:

$ debugnewvm 
2017/07/26 00:07:47 Creating debug-temp-1501027667
2017/07/26 00:07:48 instance requested
2017/07/26 00:07:59 instance created
2017/07/26 00:07:59 got instance info
2017/07/26 00:07:59 About to hit http://10.240.0.22 to see if buildlet is up yet...
2017/07/26 00:08:01 SERIAL: SeaBIOS (version 1.8.2-20170419_170401-google)
                            Total RAM Size = 0x00000000e6600000 = 3686 MiB
                            CPUs found: 4     Max CPUs supported: 4
                            found virtio-scsi at 0:3
                            virtio-scsi vendor='Google' product='PersistentDisk' rev='1' type=0 removable=0
                            virtio-scsi blksize=512 sectors=104857600 = 51200 MiB
                            drive 0x000f31a0: PCHS=0/0/0 translation=lba LCHS=1024/255/63 s=104857600
                            Booting from Hard Disk 0...
2017/07/26 00:08:04 client buildlet probe error: Get http://10.240.0.22: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
2017/07/26 00:08:05 About to hit http://10.240.0.22 to see if buildlet is up yet...
2017/07/26 00:08:06 SERIAL: 2017/07/26 00:08:05 GCEWindowsAgent: GCE Agent Started (version 4.1.1@4)
2017/07/26 00:08:07 SERIAL: 2017/07/26 00:08:07 GCEMetadataScripts: Starting startup scripts (version 4.1.6@1).
2017/07/26 00:08:07 GCEMetadataScripts: No startup scripts to run.
2017/07/26 00:08:10 client buildlet probe error: Get http://10.240.0.22: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
2017/07/26 00:08:11 About to hit http://10.240.0.22 to see if buildlet is up yet...
2017/07/26 00:08:16 client buildlet probe error: Get http://10.240.0.22: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
2017/07/26 00:08:17 About to hit http://10.240.0.22 to see if buildlet is up yet...
2017/07/26 00:08:22 client buildlet probe error: Get http://10.240.0.22: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
2017/07/26 00:08:22 SERIAL: stage0: 2017/07/26 00:08:21 bootstrap binary running.
stage0: 2017/07/26 00:08:21 waiting for network.
stage0: 2017/07/26 00:08:21 network still down; probe failure took 0s
2017/07/26 00:08:23 About to hit http://10.240.0.22 to see if buildlet is up yet...
2017/07/26 00:08:28 SERIAL: stage0: 2017/07/26 00:08:26 network still down; probe failure took 0s
2017/07/26 00:08:28 client buildlet probe error: Get http://10.240.0.22: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
2017/07/26 00:08:29 About to hit http://10.240.0.22 to see if buildlet is up yet...
2017/07/26 00:08:32 SERIAL: stage0: 2017/07/26 00:08:31 network still down; probe failure took 0s
2017/07/26 00:08:34 client buildlet probe error: Get http://10.240.0.22: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
2017/07/26 00:08:35 About to hit http://10.240.0.22 to see if buildlet is up yet...
2017/07/26 00:08:37 SERIAL: stage0: 2017/07/26 00:08:36 network still down; probe failure took 0s
2017/07/26 00:08:40 client buildlet probe error: Get http://10.240.0.22: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
2017/07/26 00:08:41 About to hit http://10.240.0.22 to see if buildlet is up yet...
2017/07/26 00:08:44 SERIAL: stage0: 2017/07/26 00:08:42 network still down; probe failure took 5s
2017/07/26 00:08:46 client buildlet probe error: Get http://10.240.0.22: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
2017/07/26 00:08:47 About to hit http://10.240.0.22 to see if buildlet is up yet...
2017/07/26 00:08:50 SERIAL: stage0: 2017/07/26 00:08:48 network still down; probe failure took 5s
2017/07/26 00:08:52 client buildlet probe error: Get http://10.240.0.22: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
2017/07/26 00:08:53 About to hit http://10.240.0.22 to see if buildlet is up yet...
2017/07/26 00:08:55 SERIAL: stage0: 2017/07/26 00:08:54 network still down; probe failure took 5s
2017/07/26 00:08:58 client buildlet probe error: Get http://10.240.0.22: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
2017/07/26 00:08:59 SERIAL: stage0: 2017/07/26 00:08:57 network is up.
stage0: 2017/07/26 00:08:57 downloading http://storage.googleapis.com/go-builder-data/buildlet.windows-amd64 to .\buildlet.exe ...
stage0: 2017/07/26 00:08:58 downloaded .\buildlet.exe (7435776 bytes)
2017/07/26 00:08:58 buildlet starting.
2017/07/26 00:08:58 Will listen on :80
2017/07/26 00:08:58 Listening on :80 ...
2017/07/26 00:08:59 About to hit http://10.240.0.22 to see if buildlet is up yet...
2017/07/26 00:08:59 buildlet probe: 200 OK
2017/07/26 00:08:59 WorkDir: C:\workdir, <nil>
2017/07/26 00:08:59 done.
2017/07/26 00:09:00 SERIAL: 2017/07/26 00:08:59 Halting in 1 second.
2017/07/26 00:09:01 SERIAL: 2017/07/26 00:09:00 Halting machine.
2017/07/26 00:09:00 Shutdown: No system-specific halt command run; will just end buildlet process.
2017/07/26 00:09:00 Ending buildlet process post-halt

From

2017/07/26 00:08:22 SERIAL: stage0: 2017/07/26 00:08:21 bootstrap binary running.

to

2017/07/26 00:08:59 SERIAL: stage0: 2017/07/26 00:08:57 network is up.

36 seconds seems like a damn long time.

Is this DHCP not coming up in the VM until late in the boot process, or something on the GCE side?

@johnsonj
Copy link
Contributor

It's likely DHCP. We're auto-logging into the VM as soon as it will let us. Windows will let you log in before every service is running.

Maybe some services could be disabled to reduce this time, but the critical metric is vm start request -> buildlet response which looks like 1min15sec. Not too bad.

@gopherbot
Copy link
Contributor

CL https://golang.org/cl/51192 mentions this issue.

@bradfitz
Copy link
Contributor Author

Maybe some services could be disabled to reduce this time,

Good idea. We don't need a file server, etc.

but the critical metric is vm start request -> buildlet response which looks like 1min15sec. Not too bad.

I dunno, I'd prefer more like 30-40 seconds. We're already few seconds for Linux (GKE), close to 0 for dedicated machines, and around 30-40 seconds for the BSD. Windows is the slow one here.

We can't fix this by sharding tests wider (without a lot more CPU) because this initial VM boot slows down the serial part of our build+test (the build).

@ianlancetaylor
Copy link
Contributor

runtime/netpoll_windows.go has not changed meaningfully since 1.8.

The layer above has moved from the net package to the internal/poll package, but any changes during that move were inadvertent.

I don't fully understand how the Windows code works. As far as I can see it always sets up op and op.pd before any operation that could trigger the poller. But I wouldn't be shocked if there is some error case that can let op.pd be nil.

gopherbot pushed a commit to golang/build that referenced this issue Jul 26, 2017
…o child

Updates golang/go#21153
Updates golang/go#17167

Change-Id: Ibe575d295468235a16c01f3420a9597373ab3891
Reviewed-on: https://go-review.googlesource.com/51192
Reviewed-by: Brad Fitzpatrick <[email protected]>
@bradfitz
Copy link
Contributor Author

@ianlancetaylor, thanks. I've pulled that discussion out to #21172.

@bradfitz
Copy link
Contributor Author

This is just GCE delays configuring the network. VMs start booting in seconds & running user code & serial works, but network doesn't work for like 30+ seconds. Known issue. Nothing we can do but wait for fixes.

@johnsonj
Copy link
Contributor

If we can't fix the startup time, can we keep a pool of ready to build windows machines running?

@bradfitz
Copy link
Contributor Author

@johnsonj, that is #11107

@golang golang locked and limited conversation to collaborators Dec 15, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Builders x/build issues (builders, bots, dashboards) FrozenDueToAge
Projects
None yet
Development

No branches or pull requests

4 participants