Skip to content

x/build: coordinator restarts #22042

Closed
Closed
@s-mang

Description

@s-mang

Coordinator sometimes restarts unexpectedly, causing build failures and gomote sessions to be terminated.

Trends:

  • I have always observed this under load (many builders building at once such that we are at our max threshold of GCP CPUs, 500)
  • When it does restart, it will happen in bursts (many restarts over a duration of hours).
  • There is a correlation between coordinator restarts and Go releases (usage of golang.org/x/build/cmd/release). Though, that could be because the release tool generates load.
  • There will be no indication in the logs that coordinator restarted due to any program error - there are no panics, or any other indication of any error on the coordinator side. Just start-up logs. Eg:
2017/10/25 18:59:36 Creating Kubernetes pod "buildlet-linux-kubestd-rn68e80e3" for host-linux-kubestd
2017/10/25 18:59:36 Registering reverse buildlet "go-builder-2" (10.240.0.20:46624) for host type host-linux-arm5spacemonkey
2017/10/25 18:59:36 Buildlet go-builder-2/10.240.0.20:46624: {Version:15} for [host-linux-arm5spacemonkey]
# restart here
2017/10/25 18:59:38 coordinator version "d55be8da83e622767c00a19031b29cb701fcc744" starting

The rest of this issue is dedicated to documenting specific instances of when I observed these restarts.

1. 08/11/2017

Maintner went down, many builders queued. (see #21383).
When we got maintner back up, all queued builders tried to run, and many (most?) failed.
Seemed to me that the failure had to be caused by all the builders sharing some resource, and hitting some limit.
@aclements helped me debug a bit; we concluded that we were probably hitting a disk write IOPS limit.
Cannot remember if we observed coordinator restarts.

2. 09/22/2017

The second instance of this that I've seen is @rsc golang.org/x/build/cmd/release (binary named build.release below) failures.

$ ./build.release boring18
...
2017/09/22 16:45:28 linux-amd64: Start.
2017/09/22 16:45:28 linux-amd64: Creating buildlet.
2017/09/22 16:47:13 linux-amd64: Error: Post https://farmer.golang.org:443/buildlet/create: EOF
$ ./build.release boring18
2017/09/22 16:49:32 linux-amd64: Start.
2017/09/22 16:49:32 linux-amd64: Creating buildlet.
2017/09/22 16:49:47 linux-amd64: Error: Post https://farmer.golang.org:443/buildlet/create: dial tcp 107.178.219.46:443: getsockopt: connection timed out
ls: cannot access go1.8.3b4.linux-amd64.tar.gz: No such file or directory
$ ./build.release boring18
2017/09/22 16:52:48 linux-amd64: Start.
2017/09/22 16:52:48 linux-amd64: Creating buildlet.
2017/09/22 16:55:05 linux-amd64: Pushing source to buildlet.
2017/09/22 16:55:22 linux-amd64: Writing VERSION file.
2017/09/22 16:55:22 linux-amd64: Cleaning goroot (pre-build).
2017/09/22 16:55:22 linux-amd64: Building.
2017/09/22 16:59:24 linux-amd64: Error: error copying response: unexpected EOF
$ ./build.release boring18
2017/09/22 17:02:11 linux-amd64: Start.
2017/09/22 17:02:11 linux-amd64: Creating buildlet.
2017/09/22 17:04:24 linux-amd64: Pushing source to buildlet.
2017/09/22 17:04:40 linux-amd64: Writing VERSION file.
2017/09/22 17:04:40 linux-amd64: Cleaning goroot (pre-build).
2017/09/22 17:04:40 linux-amd64: Building.
2017/09/22 17:11:32 linux-amd64: Error: error copying response: unexpected EOF

The last 3 failures were due to coordinator restarting in the middle of handling the requests.
There is no indication that coordinator restarted due to any program error - there are no panics, or any other indication of any error on the coordinator side.

Buildlet logs were also interspersed with non-program builder failures (just exit status 1) at the same time, eg:

13:45:31.000
2017/09/22 20:45:31 {windows-amd64-2008 f2ab41b02d4aa71fe472c56b21f7004071bd42ab } failed: tests failed: dist test failed: go_test:os: exit status 1

13:45:51.000
2017/09/22 20:45:51 [build windows-386-2008 f2ab41b02d4aa71fe472c56b21f7004071bd42ab]: Execution error running go_test:path/filepath on http://10.240.0.71 GCE VM: buildlet-windows-amd64-2008-rnbeb0f3b: buildlet: Client closed (numFails = 1)

13:46:22.000
2017/09/22 20:46:22 failed to extract snapshot for helper 10.0.3.52:80: Post http://10.0.3.52/writetgz?dir=go: EOF

13:46:33.000
2017/09/22 20:46:33 {linux-amd64-alpine 93c869ec1678e2cbe3baf4c2e48f7c3482012ae2 } failed: tests failed: dist test failed: go_test:runtime: exit status 1

13:46:37.000
2017/09/22 20:46:37 [build linux-amd64-alpine 39d24b6b6d3e25b4e3a5ad6c7b79891fc3020319]: Execution error running runtime:cpu124 on http://10.0.5.111 Kube Pod: buildlet-linux-x86-alpine-rn34e9c04: buildlet: Client closed (numFails = 1)

13:46:47.000
2017/09/22 20:46:47 {plan9-amd64-9front 42cb4dcdb59a0be3ad52bf6cf33489301d039e08 } failed: tests failed: dist test failed: go_test:net: exit status: 'go 29782: 1'

@rsc tried again a few days later to release and everything worked fine; no changes to the builder pipeline (that I am aware of).

3. 10/25/2017

timestamps (eastern)
15:06, 14:59, 13:56

shadams$ kubectl get po
NAME                                                  READY     STATUS    RESTARTS   AGE
coordinator-deployment-1785484075-5gcw0               1/1       Running   3          1h

cc @bradfitz @andybons @broady

Activity

added this to the Unreleased milestone on Sep 26, 2017
added
Buildersx/build issues (builders, bots, dashboards)
on Sep 26, 2017
broady

broady commented on Sep 26, 2017

@broady
Contributor

More long-term, we should come up with some plan to prevent these failures (backpressure) and also provide some priority queue (e.g., give priority to release builders).

For cmd/release, I did have a (mental) todo to add support for retries, but it happens infrequently enough that I just retry manually.

bradfitz

bradfitz commented on Sep 26, 2017

@bradfitz
Contributor

More long-term, we should come up with some plan to prevent these failures (backpressure) and also provide some priority queue (e.g., give priority to release builders).

That is #19178.

changed the title [-]x/build: failures under load[/-] [+]x/build: coordinator restarts under load[/+] on Oct 25, 2017
changed the title [-]x/build: coordinator restarts under load[/-] [+]x/build: coordinator restarts[/+] on Oct 25, 2017
s-mang

s-mang commented on Nov 2, 2017

@s-mang
ContributorAuthor

Thanks to @kelseyhightower for helping me debug this.
It seems coordinator has been restarting b/c running out of memory.

Coordinator had a resource limit of 2Gi memory in deployment config.
We removed the limit in the deployment config in-place (via kubectl edit deployment coordinator-deployment).
I will submit a CL to remove the limit in the deployment config source (once I find it).

bradfitz

bradfitz commented on Nov 2, 2017

@bradfitz
Contributor

Sarah, can you or Kelsey note here how this was debugged, for future reference?

The limit is at https://github.com/golang/build/blob/master/cmd/coordinator/deployment-prod.yaml#L26

s-mang

s-mang commented on Nov 2, 2017

@s-mang
ContributorAuthor

Yes, totally. Good idea.

kubectl get po coordinator-deployment-$HASH -o yaml

then under the status section, the reason for last restart was OOM.

I will submit the change to the prod.yaml and ref this issue thanks Brad!

gopherbot

gopherbot commented on Nov 2, 2017

@gopherbot
Contributor

Change https://golang.org/cl/75532 mentions this issue: cmd/coordinator: remove memory limits from deployment config

kevinburke

kevinburke commented on Nov 2, 2017

@kevinburke
Contributor

can we also monitor the amount of memory being used by each builder? GCP has a good metrics aggregator, right?

locked and limited conversation to collaborators on Nov 2, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @bradfitz@broady@kevinburke@s-mang@gopherbot

        Issue actions

          x/build: coordinator restarts · Issue #22042 · golang/go