-
Notifications
You must be signed in to change notification settings - Fork 18k
x/build/cmd/gitmirror: 10 minute timeout on git fetch is incorrectly implemented, so git fetch may still block forever #38887
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
This time, I was able to gather enough information and figure it out! 🎉 CL 203057 added a 10 minute timeout to the Also see #21922, #22485, and other similar issues. There's an accepted proposal #23019 to try to change DetailsThe
"attempt 2" meant attempt 1 failed, and this was the corresponding error message:
It seems the Gerrit server was having a temporary issue at that moment. It's likely attempt 2 failed for a similar reason, except it got stuck. The goroutine dump (via the
Line 515 of src/os/exec/exec.go in Go 1.14.2 is: Line 515 in 96745b9
Note that it's stuck in Until #23019 is resolved, the fix in |
It seems like the best thing to do is to read the output of the command with a pipe, rather than CombinedOutput, and closing our pipe after the timeout. Is it possible that sending a sigint before the CommandContext's sigkill could help? It may give @bcmills introduced a nice function in Playground for a related issue here: https://github.com/golang/playground/blob/master/internal/internal.go#L14-L18 |
This seems like another case where the behavior of Per https://pkg.go.dev/os/exec?tab=doc#CommandContext:
and https://pkg.go.dev/os?tab=doc#Process.Kill:
That process-only behavior is generally ok for However, with |
I just noticed it happened again, only 6 days since the last time: Because both instances were stuck on the Go repo, it meant that commits at https://go.googlesource.com/go weren't made available on the https://github.com/golang/go mirror during that time. Those commits were still being tested at https://build.golang.org though. I restarted the two gitmirror instances to fix the problem now: ~ $ kubectl get pods | grep gitmirror
gitmirror-rc-7bsh4 1/1 Running 0 6d20h
gitmirror-rc-l5cxp 1/1 Running 0 6d23h
~ $ kubectl delete pod gitmirror-rc-7bsh4 && sleep 60 && kubectl delete pod gitmirror-rc-l5cxp
pod "gitmirror-rc-7bsh4" deleted
pod "gitmirror-rc-l5cxp" deleted It's not hard, but if this continues to happen this frequently without us noticing, automating the fix will be more worthwhile. |
Again, today: That's 5 months 17 days since last reported occurrence. Fixed using same procedure as in #38887 (comment). CC @golang/release. |
December edition: That's 1 months 9 days since last reported occurrence. Fixed using same procedure as in #38887 (comment). |
February, 2021 edition: 2 months 17 days since last. Fixed with Doing some xkcd.com/1205 math here, assuming this task is done monthly and takes a couple of minutes, that gives a budget of a couple hours to automate this (that is, to fix the diagnosed bug). So both doing this by hand and automating are quite close in cost. |
|
May 27, 2021 edition: Fixed as before with |
Change https://golang.org/cl/325771 mentions this issue: |
And again today. |
Change https://golang.org/cl/347294 mentions this issue: |
We've seen evidence that we're not terminating Git subprocesses on timeout. This is probably an instance of https://golang.org/issue/23019, which we can deal with by killing all the children rather than just the one we started. Updates golang/go#38887. Change-Id: Ie7999122f9c063f04de1a107a57fd307e7a5c1d2 Reviewed-on: https://go-review.googlesource.com/c/build/+/347294 Trust: Heschi Kreinick <[email protected]> Run-TryBot: Heschi Kreinick <[email protected]> TryBot-Result: Go Bot <[email protected]> Reviewed-by: Dmitri Shuralyov <[email protected]>
Every now and then,
gitmirror
gets into a bad state where one of the repositories (usually, the main "go" repo) gets stuck and reports a "repo go: hung? no activity since last success ago" problem on https://farmer.golang.org/#health.This problem happens rarely (some number of times a year~). There are two instances of gitmirror in production, so as long as at least one is in a good state, mirroring continues to operate without a problem. When it happens, it's easy to spot via https://farmer.golang.org/#health and restart an instance by deleting the problematic pod (a new pod will automatically spin up by the replication controller).
This is a tracking issue for this problem to see how often it happens, investigate as needed and fix it.
Previously:
Relevant CLs:
It happened again today:
I've captured some logs and restarted one of the instances (having two bad instances meant that the Go repo was no longer being mirrored). I'll restart the other one in a bit, after doing some more investigation.
/cc @cagedmantis @toothrot @bcmills
The text was updated successfully, but these errors were encountered: