Closed
Description
2021-06-14T07:12:37-326ea43/illumos-amd64
2021-06-13T08:17:17-24cff0f/darwin-amd64-11_0
2021-05-25T23:41:42-74242ba/illumos-amd64
2021-05-04T00:03:39-496d7c6/linux-ppc64le-buildlet
2021-05-03T16:25:05-d75fbac/linux-ppc64-buildlet
2021-04-30T20:00:36-8e91458/linux-ppc64-buildlet
2021-04-30T19:41:02-0bbfc5c/illumos-amd64
See previously #45773.
Activity
bcmills commentedon Jun 14, 2021
Marking as release-blocker for Go 1.17 (CC @golang/release) because this test is new as of 1.17 — we shouldn't be shipping new tests that are known to be flaky.
If we can confirm that the test failures are not due to an actual regression in Go 1.17, we can add a call to
testenv.SkipFlaky
to unblock the release while we figure out how to make the test more robust.AndrewGMorgan commentedon Jun 14, 2021
The #45773 issue referred to exclusively linux failures. The code I was testing when I added this test case (for #44193) was linux specific, so they seemed relevant - I just hadn't seen #45773 until being cc:d on this one.
I recall being concerned while developing the fix for the #44193 that I wasn't breaking the thing it tests with my change. It didn't occur to me that the pre-existing code might not be working. Further, the fact that this present bug seems to be for non-linux code: darwin and illumos, I'm fairly sure they must have a different root cause.
Have we seen this issue on the 1.16 branch? Or do we believe this is a 1.17 regression?
bcmills commentedon Jun 15, 2021
TestSignalTrace
itself is new in 1.17 (added in CL 305149, so the test flakiness is by definition a regression.The new test verifies the fix for the bug found in #44193, which was similar to #43149, which was present in 1.16.
So my best guess is that the underlying cause was either similar or even worse in 1.16, and the test failures indicate either an incomplete fix or a bug in the test itself.
AndrewGMorgan commentedon Jun 15, 2021
I'm a bit confused, so here are some details that seem relevant to me. Please correct any of them, or add some more data points if known:
TestSignalTrace()
into the 1.16 branch.TestSignalTrace()
does not test anything directly related to the bug found in os/signal: timeout in TestAllThreadsSyscallSignals #44193.TestSignalTrace()
tests for a regressions with some of the pre-existing runtime dependencies of the feature being fixed associated with that bug. Namely, this test is validating thatruntime.stopTheWorldGC()
does not interfere with signal tracing.So, it feels like an important data point to seek is, do we have any crash logs like this from 1.16 (after CL 316869)?
If not, are we convinced that this present bug isn't purely a 1.17 regression with code paths tested by
TestSignalTrace()
?bcmills commentedon Jun 16, 2021
I don't see any failures on the dashboard for the 1.16 branch, but given how few test runs occur on that branch that doesn't tell us much. (The rate of failures at head is high enough to rule out a hardware flake, but the failures are still relatively infrequent overall.)
toothrot commentedon Jun 17, 2021
/cc @aclements @randall77 @mknyszek @prattmic
AndrewGMorgan commentedon Jun 17, 2021
The build failure logs for
linux-ppc*
seem to be all around the timeline of just before this:https://go-review.googlesource.com/c/go/+/315049/ (submitted may 4)
is it reasonable to discount the
linux-ppc*
examples from the list at the top of this present bug? Or were those after that CL was applied? If so, it looks likeillumos
anddarwin
are the architectures that remain occasionally hiccuping, with a 100ms timeout. Is there some reason to expect signal delivery to be that long on these? Or trace stop/start to be significantly slower on these architectures?bcmills commentedon Jun 17, 2021
Yes, I think it's reasonable to focus on the
illumos
anddarwin
failures.I don't know why
darwin
would be particularly slow. Theillumos
builder is a reverse-builder run by @jclulow, and I think it's a bit more heavily loaded than many of the other builders.(But, really, tests in the standard library should be written so that they don't assume fast hardware, because Go users in general should be able to run
go test all
on their code, and many Go users do have slow or older machines.)AndrewGMorgan commentedon Jun 18, 2021
To be quite honest, when I added this test, I was just reusing the pre-existing
waitSig()
function, assuming it must be the standard way to do all this. Tuning whatever timeout never crossed my mind.Given these occasional timeouts, I'd be tempted to replace all the complex code in
signal_test.go:init()
with simply setting the timeout to30 * time.Second
. Given that it is a timeout that is fatal if it ever occurs, and we don't expect it to ever fire for cause, this seems like a pragmatic way to avoid false positives on all architectures. If I read all the workaround code it seems to be saying that the OS and load etc. makes the timing unpredictable.ianlancetaylor commentedon Jun 18, 2021
We currently use
settleTime
for two different things: forwaitSig
and friends, and forquiesce
. We don't want to increase the time used byquiesce
arbitrarily, because that will slow down the tests. But I agree that there doesn't seem to be a reason to use a short timeout forwaitSig
.gopherbot commentedon Jun 19, 2021
Change https://golang.org/cl/329502 mentions this issue:
os/signal: test with a significantly longer fatal timeout
os/signal: test with a significantly longer fatal timeout