-
Notifications
You must be signed in to change notification settings - Fork 832
Fix data race in TestRecoverAlertsPostOutage #6750
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
danielblando
merged 2 commits into
cortexproject:master
from
dsabsay:fix-race-test-recover-alerts-post-outage
May 20, 2025
Merged
Fix data race in TestRecoverAlertsPostOutage #6750
danielblando
merged 2 commits into
cortexproject:master
from
dsabsay:fix-race-test-recover-alerts-post-outage
May 20, 2025
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Previously, TestRecoverAlertsPostOutage had a data race: ``` ================== WARNING: DATA RACE Read at 0x00c025bbfc30 by goroutine 56074: github.com/prometheus/prometheus/rules.(*Group).Eval.func1() /Users/danielsabsay/git/cortex/vendor/github.com/prometheus/prometheus/rules/group.go:569 +0xc8a github.com/prometheus/prometheus/rules.(*Group).Eval() /Users/danielsabsay/git/cortex/vendor/github.com/prometheus/prometheus/rules/group.go:666 +0x4e5 github.com/prometheus/prometheus/rules.DefaultEvalIterationFunc() /Users/danielsabsay/git/cortex/vendor/github.com/prometheus/prometheus/rules/manager.go:81 +0x1a5 github.com/cortexproject/cortex/pkg/ruler.ruleGroupIterationFunc() /Users/danielsabsay/git/cortex/pkg/ruler/manager.go:272 +0x4a9 github.com/prometheus/prometheus/rules.(*Group).run() /Users/danielsabsay/git/cortex/vendor/github.com/prometheus/prometheus/rules/group.go:256 +0x6a7 github.com/prometheus/prometheus/rules.(*Manager).Update.func1() /Users/danielsabsay/git/cortex/vendor/github.com/prometheus/prometheus/rules/manager.go:258 +0x11b github.com/prometheus/prometheus/rules.(*Manager).Update.gowrap2() /Users/danielsabsay/git/cortex/vendor/github.com/prometheus/prometheus/rules/manager.go:259 +0x41 Previous write at 0x00c025bbfc30 by goroutine 48: github.com/prometheus/prometheus/rules.(*Group).Eval.func1.2() /Users/danielsabsay/git/cortex/vendor/github.com/prometheus/prometheus/rules/group.go:580 +0x35e runtime.deferreturn() /Users/danielsabsay/go/pkg/mod/golang.org/[email protected]/src/runtime/panic.go:605 +0x5d github.com/prometheus/prometheus/rules.(*Group).Eval() /Users/danielsabsay/git/cortex/vendor/github.com/prometheus/prometheus/rules/group.go:666 +0x4e5 github.com/cortexproject/cortex/pkg/ruler.TestRecoverAlertsPostOutage() /Users/danielsabsay/git/cortex/pkg/ruler/ruler_test.go:2823 +0x2124 github.com/cortexproject/cortex/pkg/ruler.TestRecoverAlertsPostOutage_check_races() /Users/danielsabsay/git/cortex/pkg/ruler/ruler_race_test.go:9 +0x30 testing.tRunner() /Users/danielsabsay/go/pkg/mod/golang.org/[email protected]/src/testing/testing.go:1792 +0x225 testing.(*T).Run.gowrap1() /Users/danielsabsay/go/pkg/mod/golang.org/[email protected]/src/testing/testing.go:1851 +0x44 Goroutine 56074 (running) created at: github.com/prometheus/prometheus/rules.(*Manager).Update() /Users/danielsabsay/git/cortex/vendor/github.com/prometheus/prometheus/rules/manager.go:248 +0x69c github.com/cortexproject/cortex/pkg/ruler.(*DefaultMultiTenantManager).syncRulesToManager() /Users/danielsabsay/git/cortex/pkg/ruler/manager.go:217 +0xa5c github.com/cortexproject/cortex/pkg/ruler.(*DefaultMultiTenantManager).SyncRuleGroups() /Users/danielsabsay/git/cortex/pkg/ruler/manager.go:140 +0x1da github.com/cortexproject/cortex/pkg/ruler.(*Ruler).syncRules() /Users/danielsabsay/git/cortex/pkg/ruler/ruler.go:728 +0x71b github.com/cortexproject/cortex/pkg/ruler.TestRecoverAlertsPostOutage() /Users/danielsabsay/git/cortex/pkg/ruler/ruler_test.go:2781 +0x130b github.com/cortexproject/cortex/pkg/ruler.TestRecoverAlertsPostOutage_check_races() /Users/danielsabsay/git/cortex/pkg/ruler/ruler_race_test.go:9 +0x30 testing.tRunner() /Users/danielsabsay/go/pkg/mod/golang.org/[email protected]/src/testing/testing.go:1792 +0x225 testing.(*T).Run.gowrap1() /Users/danielsabsay/go/pkg/mod/golang.org/[email protected]/src/testing/testing.go:1851 +0x44 Goroutine 48 (running) created at: testing.(*T).Run() /Users/danielsabsay/go/pkg/mod/golang.org/[email protected]/src/testing/testing.go:1851 +0x8f2 testing.runTests.func1() /Users/danielsabsay/go/pkg/mod/golang.org/[email protected]/src/testing/testing.go:2279 +0x85 testing.tRunner() /Users/danielsabsay/go/pkg/mod/golang.org/[email protected]/src/testing/testing.go:1792 +0x225 testing.runTests() /Users/danielsabsay/go/pkg/mod/golang.org/[email protected]/src/testing/testing.go:2277 +0x96c testing.(*M).Run() /Users/danielsabsay/go/pkg/mod/golang.org/[email protected]/src/testing/testing.go:2142 +0xeea main.main() _testmain.go:167 +0x164 ================== ``` As can be seen above, the data race occurs because the rule evaluation is happening from two different goroutines: 1 that is scheduled by the normal ruler manager and 1 that is run by the test itself. The test itself calls Eval() with specific timestamps to test recovery logic. For purposes of this test, we don't want the normal evaluation loop to run at all. This commit injects a no-op GroupEvalIterationFunc into the ruler manager so that the only rule evaluation that happens is run by the test. Signed-off-by: dsabsay <[email protected]>
e2f8087
to
cfb2b16
Compare
Signed-off-by: dsabsay <[email protected]>
yeya24
approved these changes
May 19, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch!
SungJin1212
approved these changes
May 19, 2025
lgtm! |
rajagopalanand
approved these changes
May 19, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
danielblando
approved these changes
May 20, 2025
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What this PR does:
Previously, TestRecoverAlertsPostOutage had a data race:
As can be seen above, the data race occurs because the rule evaluation is happening from two different goroutines: 1 that is scheduled by the normal ruler manager and 1 that is run by the test itself.
The test itself calls Eval() with specific timestamps to test recovery logic. For purposes of this test, we don't want the normal evaluation loop to run at all.
This commit injects a no-op GroupEvalIterationFunc into the ruler manager so that the only rule evaluation that happens is run by the test.
This test failed on a dependabot PR recently and at least 2 other times on the
master
branch in the recent past.I reproduced the race here: https://github.com/cortexproject/cortex/compare/master...dsabsay:cortex:repro-race-test-recover-alerts-post-outage?expand=1
Run with:
go test -race -count=1 -run TestRecoverAlertsPostOutage_check_races ./pkg/ruler
Which issue(s) this PR fixes:
Addresses part of #6724
This is maybe not the cleanest fix. What do you think?