Description
Coordinator sometimes restarts unexpectedly, causing build failures and gomote sessions to be terminated.
Trends:
- I have always observed this under load (many builders building at once such that we are at our max threshold of GCP CPUs, 500)
- When it does restart, it will happen in bursts (many restarts over a duration of hours).
- There is a correlation between coordinator restarts and Go releases (usage of golang.org/x/build/cmd/release). Though, that could be because the release tool generates load.
- There will be no indication in the logs that coordinator restarted due to any program error - there are no panics, or any other indication of any error on the coordinator side. Just start-up logs. Eg:
2017/10/25 18:59:36 Creating Kubernetes pod "buildlet-linux-kubestd-rn68e80e3" for host-linux-kubestd
2017/10/25 18:59:36 Registering reverse buildlet "go-builder-2" (10.240.0.20:46624) for host type host-linux-arm5spacemonkey
2017/10/25 18:59:36 Buildlet go-builder-2/10.240.0.20:46624: {Version:15} for [host-linux-arm5spacemonkey]
# restart here
2017/10/25 18:59:38 coordinator version "d55be8da83e622767c00a19031b29cb701fcc744" starting
The rest of this issue is dedicated to documenting specific instances of when I observed these restarts.
1. 08/11/2017
Maintner went down, many builders queued. (see #21383).
When we got maintner back up, all queued builders tried to run, and many (most?) failed.
Seemed to me that the failure had to be caused by all the builders sharing some resource, and hitting some limit.
@aclements helped me debug a bit; we concluded that we were probably hitting a disk write IOPS limit.
Cannot remember if we observed coordinator restarts.
2. 09/22/2017
The second instance of this that I've seen is @rsc golang.org/x/build/cmd/release
(binary named build.release
below) failures.
$ ./build.release boring18
...
2017/09/22 16:45:28 linux-amd64: Start.
2017/09/22 16:45:28 linux-amd64: Creating buildlet.
2017/09/22 16:47:13 linux-amd64: Error: Post https://farmer.golang.org:443/buildlet/create: EOF
$ ./build.release boring18
2017/09/22 16:49:32 linux-amd64: Start.
2017/09/22 16:49:32 linux-amd64: Creating buildlet.
2017/09/22 16:49:47 linux-amd64: Error: Post https://farmer.golang.org:443/buildlet/create: dial tcp 107.178.219.46:443: getsockopt: connection timed out
ls: cannot access go1.8.3b4.linux-amd64.tar.gz: No such file or directory
$ ./build.release boring18
2017/09/22 16:52:48 linux-amd64: Start.
2017/09/22 16:52:48 linux-amd64: Creating buildlet.
2017/09/22 16:55:05 linux-amd64: Pushing source to buildlet.
2017/09/22 16:55:22 linux-amd64: Writing VERSION file.
2017/09/22 16:55:22 linux-amd64: Cleaning goroot (pre-build).
2017/09/22 16:55:22 linux-amd64: Building.
2017/09/22 16:59:24 linux-amd64: Error: error copying response: unexpected EOF
$ ./build.release boring18
2017/09/22 17:02:11 linux-amd64: Start.
2017/09/22 17:02:11 linux-amd64: Creating buildlet.
2017/09/22 17:04:24 linux-amd64: Pushing source to buildlet.
2017/09/22 17:04:40 linux-amd64: Writing VERSION file.
2017/09/22 17:04:40 linux-amd64: Cleaning goroot (pre-build).
2017/09/22 17:04:40 linux-amd64: Building.
2017/09/22 17:11:32 linux-amd64: Error: error copying response: unexpected EOF
The last 3 failures were due to coordinator restarting in the middle of handling the requests.
There is no indication that coordinator restarted due to any program error - there are no panics, or any other indication of any error on the coordinator side.
Buildlet logs were also interspersed with non-program builder failures (just exit status 1
) at the same time, eg:
13:45:31.000
2017/09/22 20:45:31 {windows-amd64-2008 f2ab41b02d4aa71fe472c56b21f7004071bd42ab } failed: tests failed: dist test failed: go_test:os: exit status 1
13:45:51.000
2017/09/22 20:45:51 [build windows-386-2008 f2ab41b02d4aa71fe472c56b21f7004071bd42ab]: Execution error running go_test:path/filepath on http://10.240.0.71 GCE VM: buildlet-windows-amd64-2008-rnbeb0f3b: buildlet: Client closed (numFails = 1)
13:46:22.000
2017/09/22 20:46:22 failed to extract snapshot for helper 10.0.3.52:80: Post http://10.0.3.52/writetgz?dir=go: EOF
13:46:33.000
2017/09/22 20:46:33 {linux-amd64-alpine 93c869ec1678e2cbe3baf4c2e48f7c3482012ae2 } failed: tests failed: dist test failed: go_test:runtime: exit status 1
13:46:37.000
2017/09/22 20:46:37 [build linux-amd64-alpine 39d24b6b6d3e25b4e3a5ad6c7b79891fc3020319]: Execution error running runtime:cpu124 on http://10.0.5.111 Kube Pod: buildlet-linux-x86-alpine-rn34e9c04: buildlet: Client closed (numFails = 1)
13:46:47.000
2017/09/22 20:46:47 {plan9-amd64-9front 42cb4dcdb59a0be3ad52bf6cf33489301d039e08 } failed: tests failed: dist test failed: go_test:net: exit status: 'go 29782: 1'
@rsc tried again a few days later to release and everything worked fine; no changes to the builder pipeline (that I am aware of).
3. 10/25/2017
timestamps (eastern)
15:06, 14:59, 13:56
shadams$ kubectl get po
NAME READY STATUS RESTARTS AGE
coordinator-deployment-1785484075-5gcw0 1/1 Running 3 1h
Activity
broady commentedon Sep 26, 2017
More long-term, we should come up with some plan to prevent these failures (backpressure) and also provide some priority queue (e.g., give priority to release builders).
For cmd/release, I did have a (mental) todo to add support for retries, but it happens infrequently enough that I just retry manually.
bradfitz commentedon Sep 26, 2017
That is #19178.
[-]x/build: failures under load[/-][+]x/build: coordinator restarts under load[/+][-]x/build: coordinator restarts under load[/-][+]x/build: coordinator restarts[/+]s-mang commentedon Nov 2, 2017
Thanks to @kelseyhightower for helping me debug this.
It seems coordinator has been restarting b/c running out of memory.
Coordinator had a resource limit of 2Gi memory in deployment config.
We removed the limit in the deployment config in-place (via
kubectl edit deployment coordinator-deployment
).I will submit a CL to remove the limit in the deployment config source (once I find it).
bradfitz commentedon Nov 2, 2017
Sarah, can you or Kelsey note here how this was debugged, for future reference?
The limit is at https://github.com/golang/build/blob/master/cmd/coordinator/deployment-prod.yaml#L26
s-mang commentedon Nov 2, 2017
Yes, totally. Good idea.
then under the
status
section, the reason for last restart was OOM.I will submit the change to the prod.yaml and ref this issue thanks Brad!
gopherbot commentedon Nov 2, 2017
Change https://golang.org/cl/75532 mentions this issue:
cmd/coordinator: remove memory limits from deployment config
kevinburke commentedon Nov 2, 2017
can we also monitor the amount of memory being used by each builder? GCP has a good metrics aggregator, right?