Skip to content

Conversation

dsabsay
Copy link
Contributor

@dsabsay dsabsay commented May 18, 2025

What this PR does:

Previously, TestRecoverAlertsPostOutage had a data race:

==================
WARNING: DATA RACE
Read at 0x00c025bbfc30 by goroutine 56074:
  github.com/prometheus/prometheus/rules.(*Group).Eval.func1()
      /Users/danielsabsay/git/cortex/vendor/github.com/prometheus/prometheus/rules/group.go:569 +0xc8a
  github.com/prometheus/prometheus/rules.(*Group).Eval()
      /Users/danielsabsay/git/cortex/vendor/github.com/prometheus/prometheus/rules/group.go:666 +0x4e5
  github.com/prometheus/prometheus/rules.DefaultEvalIterationFunc()
      /Users/danielsabsay/git/cortex/vendor/github.com/prometheus/prometheus/rules/manager.go:81 +0x1a5
  github.com/cortexproject/cortex/pkg/ruler.ruleGroupIterationFunc()
      /Users/danielsabsay/git/cortex/pkg/ruler/manager.go:272 +0x4a9
  github.com/prometheus/prometheus/rules.(*Group).run()
      /Users/danielsabsay/git/cortex/vendor/github.com/prometheus/prometheus/rules/group.go:256 +0x6a7
  github.com/prometheus/prometheus/rules.(*Manager).Update.func1()
      /Users/danielsabsay/git/cortex/vendor/github.com/prometheus/prometheus/rules/manager.go:258 +0x11b
  github.com/prometheus/prometheus/rules.(*Manager).Update.gowrap2()
      /Users/danielsabsay/git/cortex/vendor/github.com/prometheus/prometheus/rules/manager.go:259 +0x41

Previous write at 0x00c025bbfc30 by goroutine 48:
  github.com/prometheus/prometheus/rules.(*Group).Eval.func1.2()
      /Users/danielsabsay/git/cortex/vendor/github.com/prometheus/prometheus/rules/group.go:580 +0x35e
  runtime.deferreturn()
      /Users/danielsabsay/go/pkg/mod/golang.org/[email protected]/src/runtime/panic.go:605 +0x5d
  github.com/prometheus/prometheus/rules.(*Group).Eval()
      /Users/danielsabsay/git/cortex/vendor/github.com/prometheus/prometheus/rules/group.go:666 +0x4e5
  github.com/cortexproject/cortex/pkg/ruler.TestRecoverAlertsPostOutage()
      /Users/danielsabsay/git/cortex/pkg/ruler/ruler_test.go:2823 +0x2124
  github.com/cortexproject/cortex/pkg/ruler.TestRecoverAlertsPostOutage_check_races()
      /Users/danielsabsay/git/cortex/pkg/ruler/ruler_race_test.go:9 +0x30
  testing.tRunner()
      /Users/danielsabsay/go/pkg/mod/golang.org/[email protected]/src/testing/testing.go:1792 +0x225
  testing.(*T).Run.gowrap1()
      /Users/danielsabsay/go/pkg/mod/golang.org/[email protected]/src/testing/testing.go:1851 +0x44

Goroutine 56074 (running) created at:
  github.com/prometheus/prometheus/rules.(*Manager).Update()
      /Users/danielsabsay/git/cortex/vendor/github.com/prometheus/prometheus/rules/manager.go:248 +0x69c
  github.com/cortexproject/cortex/pkg/ruler.(*DefaultMultiTenantManager).syncRulesToManager()
      /Users/danielsabsay/git/cortex/pkg/ruler/manager.go:217 +0xa5c
  github.com/cortexproject/cortex/pkg/ruler.(*DefaultMultiTenantManager).SyncRuleGroups()
      /Users/danielsabsay/git/cortex/pkg/ruler/manager.go:140 +0x1da
  github.com/cortexproject/cortex/pkg/ruler.(*Ruler).syncRules()
      /Users/danielsabsay/git/cortex/pkg/ruler/ruler.go:728 +0x71b
  github.com/cortexproject/cortex/pkg/ruler.TestRecoverAlertsPostOutage()
      /Users/danielsabsay/git/cortex/pkg/ruler/ruler_test.go:2781 +0x130b
  github.com/cortexproject/cortex/pkg/ruler.TestRecoverAlertsPostOutage_check_races()
      /Users/danielsabsay/git/cortex/pkg/ruler/ruler_race_test.go:9 +0x30
  testing.tRunner()
      /Users/danielsabsay/go/pkg/mod/golang.org/[email protected]/src/testing/testing.go:1792 +0x225
  testing.(*T).Run.gowrap1()
      /Users/danielsabsay/go/pkg/mod/golang.org/[email protected]/src/testing/testing.go:1851 +0x44

Goroutine 48 (running) created at:
  testing.(*T).Run()
      /Users/danielsabsay/go/pkg/mod/golang.org/[email protected]/src/testing/testing.go:1851 +0x8f2
  testing.runTests.func1()
      /Users/danielsabsay/go/pkg/mod/golang.org/[email protected]/src/testing/testing.go:2279 +0x85
  testing.tRunner()
      /Users/danielsabsay/go/pkg/mod/golang.org/[email protected]/src/testing/testing.go:1792 +0x225
  testing.runTests()
      /Users/danielsabsay/go/pkg/mod/golang.org/[email protected]/src/testing/testing.go:2277 +0x96c
  testing.(*M).Run()
      /Users/danielsabsay/go/pkg/mod/golang.org/[email protected]/src/testing/testing.go:2142 +0xeea
  main.main()
      _testmain.go:167 +0x164
==================

As can be seen above, the data race occurs because the rule evaluation is happening from two different goroutines: 1 that is scheduled by the normal ruler manager and 1 that is run by the test itself.

The test itself calls Eval() with specific timestamps to test recovery logic. For purposes of this test, we don't want the normal evaluation loop to run at all.

This commit injects a no-op GroupEvalIterationFunc into the ruler manager so that the only rule evaluation that happens is run by the test.


This test failed on a dependabot PR recently and at least 2 other times on the master branch in the recent past.

I reproduced the race here: https://github.com/cortexproject/cortex/compare/master...dsabsay:cortex:repro-race-test-recover-alerts-post-outage?expand=1
Run with: go test -race -count=1 -run TestRecoverAlertsPostOutage_check_races ./pkg/ruler

Which issue(s) this PR fixes:
Addresses part of #6724

This is maybe not the cleanest fix. What do you think?

@dosubot dosubot bot added the component/rules Bits & bobs todo with rules and alerts: the ruler, config service etc. label May 18, 2025
Previously, TestRecoverAlertsPostOutage had a data race:

```
==================
WARNING: DATA RACE
Read at 0x00c025bbfc30 by goroutine 56074:
  github.com/prometheus/prometheus/rules.(*Group).Eval.func1()
      /Users/danielsabsay/git/cortex/vendor/github.com/prometheus/prometheus/rules/group.go:569 +0xc8a
  github.com/prometheus/prometheus/rules.(*Group).Eval()
      /Users/danielsabsay/git/cortex/vendor/github.com/prometheus/prometheus/rules/group.go:666 +0x4e5
  github.com/prometheus/prometheus/rules.DefaultEvalIterationFunc()
      /Users/danielsabsay/git/cortex/vendor/github.com/prometheus/prometheus/rules/manager.go:81 +0x1a5
  github.com/cortexproject/cortex/pkg/ruler.ruleGroupIterationFunc()
      /Users/danielsabsay/git/cortex/pkg/ruler/manager.go:272 +0x4a9
  github.com/prometheus/prometheus/rules.(*Group).run()
      /Users/danielsabsay/git/cortex/vendor/github.com/prometheus/prometheus/rules/group.go:256 +0x6a7
  github.com/prometheus/prometheus/rules.(*Manager).Update.func1()
      /Users/danielsabsay/git/cortex/vendor/github.com/prometheus/prometheus/rules/manager.go:258 +0x11b
  github.com/prometheus/prometheus/rules.(*Manager).Update.gowrap2()
      /Users/danielsabsay/git/cortex/vendor/github.com/prometheus/prometheus/rules/manager.go:259 +0x41

Previous write at 0x00c025bbfc30 by goroutine 48:
  github.com/prometheus/prometheus/rules.(*Group).Eval.func1.2()
      /Users/danielsabsay/git/cortex/vendor/github.com/prometheus/prometheus/rules/group.go:580 +0x35e
  runtime.deferreturn()
      /Users/danielsabsay/go/pkg/mod/golang.org/[email protected]/src/runtime/panic.go:605 +0x5d
  github.com/prometheus/prometheus/rules.(*Group).Eval()
      /Users/danielsabsay/git/cortex/vendor/github.com/prometheus/prometheus/rules/group.go:666 +0x4e5
  github.com/cortexproject/cortex/pkg/ruler.TestRecoverAlertsPostOutage()
      /Users/danielsabsay/git/cortex/pkg/ruler/ruler_test.go:2823 +0x2124
  github.com/cortexproject/cortex/pkg/ruler.TestRecoverAlertsPostOutage_check_races()
      /Users/danielsabsay/git/cortex/pkg/ruler/ruler_race_test.go:9 +0x30
  testing.tRunner()
      /Users/danielsabsay/go/pkg/mod/golang.org/[email protected]/src/testing/testing.go:1792 +0x225
  testing.(*T).Run.gowrap1()
      /Users/danielsabsay/go/pkg/mod/golang.org/[email protected]/src/testing/testing.go:1851 +0x44

Goroutine 56074 (running) created at:
  github.com/prometheus/prometheus/rules.(*Manager).Update()
      /Users/danielsabsay/git/cortex/vendor/github.com/prometheus/prometheus/rules/manager.go:248 +0x69c
  github.com/cortexproject/cortex/pkg/ruler.(*DefaultMultiTenantManager).syncRulesToManager()
      /Users/danielsabsay/git/cortex/pkg/ruler/manager.go:217 +0xa5c
  github.com/cortexproject/cortex/pkg/ruler.(*DefaultMultiTenantManager).SyncRuleGroups()
      /Users/danielsabsay/git/cortex/pkg/ruler/manager.go:140 +0x1da
  github.com/cortexproject/cortex/pkg/ruler.(*Ruler).syncRules()
      /Users/danielsabsay/git/cortex/pkg/ruler/ruler.go:728 +0x71b
  github.com/cortexproject/cortex/pkg/ruler.TestRecoverAlertsPostOutage()
      /Users/danielsabsay/git/cortex/pkg/ruler/ruler_test.go:2781 +0x130b
  github.com/cortexproject/cortex/pkg/ruler.TestRecoverAlertsPostOutage_check_races()
      /Users/danielsabsay/git/cortex/pkg/ruler/ruler_race_test.go:9 +0x30
  testing.tRunner()
      /Users/danielsabsay/go/pkg/mod/golang.org/[email protected]/src/testing/testing.go:1792 +0x225
  testing.(*T).Run.gowrap1()
      /Users/danielsabsay/go/pkg/mod/golang.org/[email protected]/src/testing/testing.go:1851 +0x44

Goroutine 48 (running) created at:
  testing.(*T).Run()
      /Users/danielsabsay/go/pkg/mod/golang.org/[email protected]/src/testing/testing.go:1851 +0x8f2
  testing.runTests.func1()
      /Users/danielsabsay/go/pkg/mod/golang.org/[email protected]/src/testing/testing.go:2279 +0x85
  testing.tRunner()
      /Users/danielsabsay/go/pkg/mod/golang.org/[email protected]/src/testing/testing.go:1792 +0x225
  testing.runTests()
      /Users/danielsabsay/go/pkg/mod/golang.org/[email protected]/src/testing/testing.go:2277 +0x96c
  testing.(*M).Run()
      /Users/danielsabsay/go/pkg/mod/golang.org/[email protected]/src/testing/testing.go:2142 +0xeea
  main.main()
      _testmain.go:167 +0x164
==================
```

As can be seen above, the data race occurs because the rule evaluation
is happening from two different goroutines: 1 that is scheduled by the
normal ruler manager and 1 that is run by the test itself.

The test itself calls Eval() with specific timestamps to test recovery
logic. For purposes of this test, we don't want the normal evaluation
loop to run at all.

This commit injects a no-op GroupEvalIterationFunc into the ruler
manager so that the only rule evaluation that happens is run by the
test.

Signed-off-by: dsabsay <[email protected]>
@dsabsay dsabsay force-pushed the fix-race-test-recover-alerts-post-outage branch from e2f8087 to cfb2b16 Compare May 18, 2025 02:48
Signed-off-by: dsabsay <[email protected]>
@yeya24 yeya24 requested a review from rajagopalanand May 19, 2025 02:43
Copy link
Contributor

@yeya24 yeya24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch!

@SungJin1212
Copy link
Member

lgtm!

Copy link
Contributor

@rajagopalanand rajagopalanand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@danielblando danielblando merged commit 8016956 into cortexproject:master May 20, 2025
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/rules Bits & bobs todo with rules and alerts: the ruler, config service etc. size/M
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants