x/build/cmd/gitmirror: 10 minute timeout on git fetch is incorrectly implemented, so git fetch may still block forever #38887

dmitshur · 2020-05-05T21:25:14Z

Every now and then, gitmirror gets into a bad state where one of the repositories (usually, the main "go" repo) gets stuck and reports a "repo go: hung? no activity since last success ago" problem on https://farmer.golang.org/#health.

This problem happens rarely (some number of times a year~). There are two instances of gitmirror in production, so as long as at least one is in a good state, mirroring continues to operate without a problem. When it happens, it's easy to spot via https://farmer.golang.org/#health and restart an instance by deleting the problematic pod (a new pod will automatically spin up by the replication controller).

This is a tracking issue for this problem to see how often it happens, investigate as needed and fix it.

Previously:

Relevant CLs:

It happened again today:

I've captured some logs and restarted one of the instances (having two bad instances meant that the Go repo was no longer being mirrored). I'll restart the other one in a bit, after doing some more investigation.

/cc @cagedmantis @toothrot @bcmills

The text was updated successfully, but these errors were encountered:

dmitshur · 2020-05-05T23:57:16Z

This time, I was able to gather enough information and figure it out! 🎉

CL 203057 added a 10 minute timeout to the git fetch command execution by using exec.CommandContext. My intention there was to not let it run for more than 10 minutes, ever. I was not aware that Wait may still continue to block if a the command runs a subprocess. git fetch is a high level git command, it runs a lower level one (e.g., git-fetch-pack or so). As a result, it was still possible for git fetch to block forever in some cases.

Also see #21922, #22485, and other similar issues.

There's an accepted proposal #23019 to try to change Wait behavior.

Details

The /debug/watcher/go endpoint of gitmirror reported:

2020-05-05T20:40:22Z   2h44m17s ago         running git fetch origin, attempt 2
2020-05-05T20:37:29Z   2h47m11s ago         running git fetch origin
2020-05-05T20:37:29Z   2h47m11s ago         got update tickle
2020-05-05T20:37:29Z   2h47m11s ago         waiting
2020-05-05T20:37:29Z   2h47m11s ago         did sync to github
2020-05-05T20:37:28Z   2h47m12s ago         syncing to github
2020-05-05T20:37:28Z   2h47m12s ago         ran git fetch
2020-05-05T20:34:08Z   2h50m31s ago         running git fetch origin

"attempt 2" meant attempt 1 failed, and this was the corresponding error message:

2020/05/05 20:40:17 go: git fetch: exit status 128

error: RPC failed; HTTP 502 curl 22 The requested URL returned error: 502 
fatal: the remote end hung up unexpectedly

It seems the Gerrit server was having a temporary issue at that moment. It's likely attempt 2 failed for a similar reason, except it got stuck.

The goroutine dump (via the /debug/goroutines endpoint) was very insightful:

goroutine 151 [chan receive, 160 minutes]:
os/exec.(*Cmd).Wait(0xc00040ab00, 0x0, 0x0)
	/usr/local/go/src/os/exec/exec.go:515 +0x125
os/exec.(*Cmd).Run(0xc00040ab00, 0xc00009ce70, 0xc00040ab00)
	/usr/local/go/src/os/exec/exec.go:341 +0x5c
os/exec.(*Cmd).CombinedOutput(0xc00040ab00, 0x3, 0xc00039fd68, 0x3, 0x3, 0xc00040ab00)
	/usr/local/go/src/os/exec/exec.go:567 +0x91
main.(*Repo).fetch.func2(0x0, 0x0)
	/go/src/golang.org/x/build/cmd/gitmirror/gitmirror.go:508 +0x1c0
main.try(0x3, 0xc00039fe08, 0x18, 0xef32a0)
	/go/src/golang.org/x/build/cmd/gitmirror/gitmirror.go:642 +0x63
main.(*Repo).fetch(0xc000360000, 0x0, 0x0)
	/go/src/golang.org/x/build/cmd/gitmirror/gitmirror.go:499 +0xd6
main.(*Repo).Loop(0xc000360000)
	/go/src/golang.org/x/build/cmd/gitmirror/gitmirror.go:448 +0x7f
main.runGitMirror.func1(0xc00002e7d6, 0x2)
	/go/src/golang.org/x/build/cmd/gitmirror/gitmirror.go:153 +0x2bb
created by main.runGitMirror
	/go/src/golang.org/x/build/cmd/gitmirror/gitmirror.go:162 +0x342

Line 515 of src/os/exec/exec.go in Go 1.14.2 is:

go/src/os/exec/exec.go

Line 515 in 96745b9

if err := <-c.errch; err != nil && copyError == nil {

Note that it's stuck in cmd.Wait for 160 minutes.

Until #23019 is resolved, the fix in cmd/gitmirror is to ensure git fetch isn't allowed to run for more than 10 minutes in a way that is functional.

toothrot · 2020-05-06T15:00:23Z

It seems like the best thing to do is to read the output of the command with a pipe, rather than CombinedOutput, and closing our pipe after the timeout.

Is it possible that sending a sigint before the CommandContext's sigkill could help? It may give git fetch an opportunity to kill it's subprocesses, assuming git is well behaved.

@bcmills introduced a nice function in Playground for a related issue here: https://github.com/golang/playground/blob/master/internal/internal.go#L14-L18

bcmills · 2020-05-06T18:02:46Z

This seems like another case where the behavior of exec.CommandContext is very suboptimal.

Per https://pkg.go.dev/os/exec?tab=doc#CommandContext:

The provided context is used to kill the process (by calling os.Process.Kill) …

and https://pkg.go.dev/os?tab=doc#Process.Kill:

This only kills the Process itself, not any other processes it may have started.

That process-only behavior is generally ok for SIGINT, because the process has the opportunity to do its own signal-propagation to its subprocesses.

However, with SIGKILL the process has no such opportunity to forward the signal, because the signal cannot be caught. (Instead, the alternative there is for the Go program to use the less-portable syscall.Kill to send the signal to the whole group.)

dmitshur · 2020-05-12T20:20:02Z

I just noticed it happened again, only 6 days since the last time:

Because both instances were stuck on the Go repo, it meant that commits at https://go.googlesource.com/go weren't made available on the https://github.com/golang/go mirror during that time. Those commits were still being tested at https://build.golang.org though.

I restarted the two gitmirror instances to fix the problem now:

~ $ kubectl get pods | grep gitmirror
gitmirror-rc-7bsh4                         1/1     Running   0          6d20h
gitmirror-rc-l5cxp                         1/1     Running   0          6d23h
~ $ kubectl delete pod gitmirror-rc-7bsh4 && sleep 60 && kubectl delete pod gitmirror-rc-l5cxp
pod "gitmirror-rc-7bsh4" deleted
pod "gitmirror-rc-l5cxp" deleted

It's not hard, but if this continues to happen this frequently without us noticing, automating the fix will be more worthwhile.

dmitshur · 2020-10-29T23:38:28Z

Again, today:

That's 5 months 17 days since last reported occurrence.

Fixed using same procedure as in #38887 (comment).

CC @golang/release.

dmitshur · 2020-12-09T00:18:03Z

December edition:

That's 1 months 9 days since last reported occurrence.

Fixed using same procedure as in #38887 (comment).

dmitshur · 2021-02-26T00:42:14Z

February, 2021 edition:

2 months 17 days since last. Fixed with kubectl delete pod gitmirror-rc-6p6lf.

Doing some xkcd.com/1205 math here, assuming this task is done monthly and takes a couple of minutes, that gives a budget of a couple hours to automate this (that is, to fix the diagnosed bug). So both doing this by hand and automating are quite close in cost.

dmitshur · 2021-04-07T17:11:11Z

April 7, 2021 edition:

1 month 13 days since last. Fixed with kubectl delete pod gitmirror-rc-cbhf4.

cagedmantis · 2021-04-14T03:41:44Z

instance gitmirror-rc-fqrxd: repo exp: hung? no activity since last success 1h8m32.429381987s ago
#45554 was filed.

dmitshur · 2021-05-28T02:00:10Z

May 27, 2021 edition:

Fixed as before with kubectl delete pod gitmirror-rc-423aca9388b272046d4bc507a0414e1f-v27xm. (CC @heschi FYI; also see last paragraph in #38887 (comment).)

gopherbot · 2021-06-07T17:11:46Z

Change https://golang.org/cl/325771 mentions this issue: cmd/gitmirror: kill subprocesses with SIGINT before SIGKILL

heschi · 2021-06-11T16:08:05Z

And again today.

gopherbot · 2021-09-02T16:15:25Z

Change https://golang.org/cl/347294 mentions this issue: cmd/gitmirror: kill child processes upon timeout

We've seen evidence that we're not terminating Git subprocesses on timeout. This is probably an instance of https://golang.org/issue/23019, which we can deal with by killing all the children rather than just the one we started. Updates golang/go#38887. Change-Id: Ie7999122f9c063f04de1a107a57fd307e7a5c1d2 Reviewed-on: https://go-review.googlesource.com/c/build/+/347294 Trust: Heschi Kreinick <[email protected]> Run-TryBot: Heschi Kreinick <[email protected]> TryBot-Result: Go Bot <[email protected]> Reviewed-by: Dmitri Shuralyov <[email protected]>

dmitshur added Builders x/build issues (builders, bots, dashboards) NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels May 5, 2020

dmitshur added this to the Unreleased milestone May 5, 2020

dmitshur added NeedsFix The path to resolution is known, but the work has not been done. and removed NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels May 5, 2020

dmitshur changed the title ~~x/build/cmd/gitmirror: track, investigate, and resolve rare cases of repo hanging~~ x/build/cmd/gitmirror: 10 minute timeout on git fetch is incorrectly implemented, so git fetch may still block forever May 5, 2020

dmitshur mentioned this issue May 5, 2020

os/exec: consider changing Wait to stop copying goroutines rather than waiting for them #23019

Closed

cagedmantis mentioned this issue Apr 14, 2021

x/build/cmd/gitmirror: gitmirror instance is lost? #45554

Closed

gopherbot closed this as completed in golang/build@1bda1af Jun 14, 2021

golang locked and limited conversation to collaborators Sep 2, 2022

gopherbot added the FrozenDueToAge label Sep 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

x/build/cmd/gitmirror: 10 minute timeout on git fetch is incorrectly implemented, so git fetch may still block forever #38887

x/build/cmd/gitmirror: 10 minute timeout on git fetch is incorrectly implemented, so git fetch may still block forever #38887

dmitshur commented May 5, 2020 •

edited

Loading

dmitshur commented May 5, 2020 •

edited

Loading

toothrot commented May 6, 2020

bcmills commented May 6, 2020 •

edited

Loading

dmitshur commented May 12, 2020

dmitshur commented Oct 29, 2020

dmitshur commented Dec 9, 2020

dmitshur commented Feb 26, 2021

dmitshur commented Apr 7, 2021

cagedmantis commented Apr 14, 2021

dmitshur commented May 28, 2021

gopherbot commented Jun 7, 2021

heschi commented Jun 11, 2021

gopherbot commented Sep 2, 2021

x/build/cmd/gitmirror: 10 minute timeout on git fetch is incorrectly implemented, so git fetch may still block forever #38887

x/build/cmd/gitmirror: 10 minute timeout on git fetch is incorrectly implemented, so git fetch may still block forever #38887

Comments

dmitshur commented May 5, 2020 • edited Loading

dmitshur commented May 5, 2020 • edited Loading

Details

toothrot commented May 6, 2020

bcmills commented May 6, 2020 • edited Loading

dmitshur commented May 12, 2020

dmitshur commented Oct 29, 2020

dmitshur commented Dec 9, 2020

dmitshur commented Feb 26, 2021

dmitshur commented Apr 7, 2021

cagedmantis commented Apr 14, 2021

dmitshur commented May 28, 2021

gopherbot commented Jun 7, 2021

heschi commented Jun 11, 2021

gopherbot commented Sep 2, 2021

dmitshur commented May 5, 2020 •

edited

Loading

dmitshur commented May 5, 2020 •

edited

Loading

bcmills commented May 6, 2020 •

edited

Loading