Description
What version of Go are you using (go version
)?
Tracked the specific issue to commit b4b0144 via git bisect
$ go version go version devel go1.17-d568e6e075 Tue Jul 20 19:54:36 2021 +0000 linux/amd64
Does this issue reproduce with the latest release?
Yes
What operating system and processor architecture are you using (go env
)?
go env
Output
$ go env andrewvc@LAPTOP-80O11FM2 ~/p/b/heartbeat (fix-timer-failure)> go env warning: GOPATH set to GOROOT (/home/andrewvc/projects/go) has no effect GO111MODULE="" GOARCH="amd64" GOBIN="" GOCACHE="/home/andrewvc/.cache/go-build" GOENV="/home/andrewvc/.config/go/env" GOEXE="" GOFLAGS="" GOHOSTARCH="amd64" GOHOSTOS="linux" GOINSECURE="" GOMODCACHE="/home/andrewvc/projects/go/pkg/mod" GONOPROXY="" GONOSUMDB="" GOOS="linux" GOPATH="/home/andrewvc/projects/go" GOPRIVATE="" GOPROXY="https://proxy.golang.org,direct" GOROOT="/home/andrewvc/projects/go" GOSUMDB="sum.golang.org" GOTMPDIR="" GOTOOLDIR="/home/andrewvc/projects/go/pkg/tool/linux_amd64" GCCGO="gccgo" AR="ar" CC="gcc" CXX="g++" CGO_ENABLED="1" GOMOD="/home/andrewvc/projects/beats/go.mod" CGO_CFLAGS="-g -O2" CGO_CPPFLAGS="" CGO_CXXFLAGS="-g -O2" CGO_FFLAGS="-g -O2" CGO_LDFLAGS="-g -O2" PKG_CONFIG="pkg-config" GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build674387485=/tmp/go-build -gno-record-gcc-switches"
What did you do?
Longstanding unit tests (~1.5 years old) started sporadically failing against go 1.16.x
Our program (Heartbeat) rapidly stops and resets a single timer based on user submitted jobs, effectively testing the golang timer. Further investigation revealed that timer.Reset
was no longer resetting the timer consistently. Every 10-40k iterations or so it would have no effect, resulting in a non-triggering timer, and in our case, a deadlocked program.
We tracked the specific issue to a change introduced in golang commit b4b0144 via git bisect
The failure can be reproduced by running from the special branch below, which contains an enhanced test suite for Heartbeat using a watchdog timer to catch the failed timer.
# Examples use a zip download to prevent a full repo clone
curl -L https://github.com/andrewvc/beats/archive/refs/heads/broken-timer.zip -o broken-timer.zip
unzip -q broken-timer.zip
cd beats-broken-timer/heartbeat
go test -timeout 30s -run '^TestStress$' github.com/elastic/beats/v7/heartbeat/scheduler/timerqueue
We are now avoiding Reset
in favor of NewTimer
in a workaround PR elastic/beats#27006 . You can validate this by deleting the Reset
call here and replacing it with the NewTimer
call here
Digging into the golang source code I discovered that I could fix the issue by commenting out the optimization on this line inside adjusttimers . It seems that the accounting of that variable may have an issue somewhere. The code is quite tricky, heavily concurrent, etc, and could use the eye of someone familiar with it.
What did you expect to see?
I expected the timer to fire consistently when reset.
What did you see instead?
Nothing, after enhancing the test suite for heartbeat to dump traces it was apparent that the program was in an idle state, with no timer scheduled, and no other code blocked.
Activity
[-]Timer reset broken under heavy use since go1.16 timer optimizations added[/-][+]time: Timer reset broken under heavy use since go1.16 timer optimizations added[/+]ianlancetaylor commentedon Jul 22, 2021
Thanks for the good test case and analysis.
ianlancetaylor commentedon Jul 22, 2021
@gopherbot Please open backport to 1.16.
This bug also exists in 1.16. It can cause programs that use
Timer.Reset
to fail to run a timer when it is ready.gopherbot commentedon Jul 22, 2021
Backport issue(s) opened: #47332 (for 1.16).
Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://golang.org/wiki/MinorReleases.
gopherbot commentedon Jul 22, 2021
Change https://golang.org/cl/336432 mentions this issue:
runtime: don't clear timerModifiedEarliest if adjustTimers is 0
Fix timer failure (#27006)
Fix timer failure (#27006)
Fix timer failure (#27006)
Fix timer failure (#27006) (#27017)
19 remaining items