Fix data race in user list of a queue #6160

justinjung04 · 2024-08-13T17:01:53Z

What this PR does:

Adds mutex on user list of a queue. This rarely happens, but was the root cause of the flaky test failing with the message below:

==================
WARNING: DATA RACE
Write at 0x00c000590000 by goroutine 52:
  github.com/cortexproject/cortex/pkg/scheduler/queue.(*queues).deleteQueue()
      /__w/cortex/cortex/pkg/scheduler/queue/user_queues.go:114 +0x1c4
  github.com/cortexproject/cortex/pkg/scheduler/queue.TestQueueConcurrency.func1()
      /__w/cortex/cortex/pkg/scheduler/queue/user_queues_test.go:486 +0x12a
  github.com/cortexproject/cortex/pkg/scheduler/queue.TestQueueConcurrency.gowrap1()
      /__w/cortex/cortex/pkg/scheduler/queue/user_queues_test.go:488 +0x41

Previous read at 0x00c000590000 by goroutine 51:
  github.com/cortexproject/cortex/pkg/scheduler/queue.(*queues).getNextQueueForQuerier()
      /__w/cortex/cortex/pkg/scheduler/queue/user_queues.go:234 +0x129
  github.com/cortexproject/cortex/pkg/scheduler/queue.TestQueueConcurrency.func1()
      /__w/cortex/cortex/pkg/scheduler/queue/user_queues_test.go:482 +0x1a4
  github.com/cortexproject/cortex/pkg/scheduler/queue.TestQueueConcurrency.gowrap1()
      /__w/cortex/cortex/pkg/scheduler/queue/user_queues_test.go:488 +0x41

Goroutine 52 (running) created at:
  github.com/cortexproject/cortex/pkg/scheduler/queue.TestQueueConcurrency()
      /__w/cortex/cortex/pkg/scheduler/queue/user_queues_test.go:477 +0x324
  testing.tRunner()
      /usr/local/go/src/testing/testing.go:1689 +0x21e
  testing.(*T).Run.gowrap1()
      /usr/local/go/src/testing/testing.go:1742 +0x44

Goroutine 51 (running) created at:
  github.com/cortexproject/cortex/pkg/scheduler/queue.TestQueueConcurrency()
      /__w/cortex/cortex/pkg/scheduler/queue/user_queues_test.go:477 +0x324
  testing.tRunner()
      /usr/local/go/src/testing/testing.go:1689 +0x21e
  testing.(*T).Run.gowrap1()
      /usr/local/go/src/testing/testing.go:1742 +0x44
==================

It seems like we always had this issue, but it was only surfaced after a race condition test was added in the previous PR.
Unfortunately I wasn't able to create a test case where I could reproduce this consistently.

Which issue(s) this PR fixes:
Fixes #6109

Checklist

[n/a] Tests updated
[n/a] Documentation added
CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

Signed-off-by: Justin Jung <[email protected]>

danielblando · 2024-08-13T18:18:08Z

I was looking at the code and we already have a mutex for userQueuesMx and it looks like this is always blocked before the new mutex usersMx. Does it make sense to create a new mutex or the userQueue and users will need to be locked together? It seems we are always tracking both together. Maybe it will be just easier to make one mutex for both.

From the tests failing it seems that the func getNextQueueForQuerier is usually related to the issue and it is the only func which locks userQueuesMx after checking users. Isn't just simpler to make the lock earlier and have only one lock?

Delete queue already uses the userQueuesMx to also block users

cortex/pkg/scheduler/queue/user_queues.go

Line 106 in f088997

q.userQueuesMx.Lock()

GetOrAddQueue also does the same

cortex/pkg/scheduler/queue/user_queues.go

Line 141 in f088997

q.userQueuesMx.Lock()

GetNextQueueForQuerier is the only one that does it slight after

cortex/pkg/scheduler/queue/user_queues.go

Line 247 in f088997

q.userQueuesMx.RLock()

justinjung04 · 2024-08-13T18:40:31Z

Confirmed that users is used for iteration when searching for next queue to handle. So,

When we call getOrAddQueue, we append user name to that list
When we call deleteQueue, we delete the user name from that list
when we call getNextQueueForQuerier, we get the user name from the list with index, and go to the user queue map to get the queue

Basically the user list is a bridge to access the user queue at the end, so it makes sense to control them with a single lock. Will make an update.

justinjung04 · 2024-08-13T19:06:47Z

pkg/scheduler/queue/user_queues.go

-		q.userQueuesMx.RLock()
-		defer q.userQueuesMx.RUnlock()


actually this shouldn't have been in a for loop, as the defer doesn't get called until the function returns, not when each loop iteration is done.

I think the defer is fine... As it is a Rlock so re-entranable. By moving this lock out of the for loop, is the main point to protect q.users at L227? Then it looks good to me.

justinjung04 · 2024-08-13T19:08:17Z

pkg/scheduler/queue/user_queues.go

@@ -222,6 +221,9 @@ func (q *queues) createUserRequestQueue(userID string) userRequestQueue {
 func (q *queues) getNextQueueForQuerier(lastUserIndex int, querierID string) (userRequestQueue, string, int) {
 	uid := lastUserIndex

+	q.queuesMx.RLock()
+	defer q.queuesMx.RUnlock()


I also checked if there's an opportunity for me to not use defer and manually unlock (since now i'm locking two objects at the same time). But the slices and maps are pass by reference in golang, so it was better for me to keep the lock until the function returns (we continue to read properties of those objects until the function returns)

danielblando

Thanks

yeya24 · 2024-08-13T21:54:19Z

CHANGELOG.md

@@ -53,6 +53,7 @@
 * [BUGFIX] Ingester: Include out-of-order head compaction when compacting TSDB head. #6108
 * [BUGFIX] Ingester: Fix `cortex_ingester_tsdb_mmap_chunks_total` metric. #6134
 * [BUGFIX] Query Frontend: Fix query rejection bug for metadata queries. #6143
+* [BUGFIX] Scheduler: Fix data race in user list of a queue. #6160


We don't need a dedicated entry for this fix. We can add the PR number to L52.

yeya24 · 2024-08-13T21:59:50Z

pkg/scheduler/queue/user_queues.go

-		q.userQueuesMx.RLock()
-		defer q.userQueuesMx.RUnlock()


I think the defer is fine... As it is a Rlock so re-entranable. By moving this lock out of the for loop, is the main point to protect q.users at L227? Then it looks good to me.

Signed-off-by: Justin Jung <[email protected]>

yeya24

Thanks for the fix!

pull-request-size bot added the size/S label Aug 13, 2024

Fix data race in user list of a queue

db9dc08

Signed-off-by: Justin Jung <[email protected]>

justinjung04 force-pushed the bugfix branch from dac496e to db9dc08 Compare August 13, 2024 17:02

Add changelog

f088997

Signed-off-by: Justin Jung <[email protected]>

justinjung04 commented Aug 13, 2024

View reviewed changes

justinjung04 marked this pull request as ready for review August 13, 2024 19:08

danielblando approved these changes Aug 13, 2024

View reviewed changes

danielblando requested a review from yeya24 August 13, 2024 20:19

yeya24 reviewed Aug 13, 2024

View reviewed changes

Use single queue for both user list and user queue map

7f773da

Signed-off-by: Justin Jung <[email protected]>

justinjung04 force-pushed the bugfix branch from 62bb6b7 to 7f773da Compare August 13, 2024 22:15

yeya24 approved these changes Aug 13, 2024

View reviewed changes

yeya24 merged commit ee8f8e9 into cortexproject:master Aug 13, 2024
15 checks passed

justinjung04 deleted the bugfix branch March 26, 2025 17:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix data race in user list of a queue #6160

Fix data race in user list of a queue #6160

Uh oh!

justinjung04 commented Aug 13, 2024 •

edited

Loading

Uh oh!

danielblando commented Aug 13, 2024

Uh oh!

justinjung04 commented Aug 13, 2024

Uh oh!

justinjung04 Aug 13, 2024

Uh oh!

yeya24 Aug 13, 2024

Uh oh!

justinjung04 Aug 13, 2024

Uh oh!

danielblando left a comment

Uh oh!

yeya24 Aug 13, 2024

Uh oh!

yeya24 Aug 13, 2024

Uh oh!

yeya24 left a comment

Uh oh!

Uh oh!

Uh oh!

Fix data race in user list of a queue #6160

Fix data race in user list of a queue #6160

Uh oh!

Conversation

justinjung04 commented Aug 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

danielblando commented Aug 13, 2024

Uh oh!

justinjung04 commented Aug 13, 2024

Uh oh!

justinjung04 Aug 13, 2024

Choose a reason for hiding this comment

Uh oh!

yeya24 Aug 13, 2024

Choose a reason for hiding this comment

Uh oh!

justinjung04 Aug 13, 2024

Choose a reason for hiding this comment

Uh oh!

danielblando left a comment

Choose a reason for hiding this comment

Uh oh!

yeya24 Aug 13, 2024

Choose a reason for hiding this comment

Uh oh!

yeya24 Aug 13, 2024

Choose a reason for hiding this comment

Uh oh!

yeya24 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

justinjung04 commented Aug 13, 2024 •

edited

Loading