Skip to content

Fix data race in user list of a queue #6160

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Aug 13, 2024
Merged

Conversation

justinjung04
Copy link
Contributor

@justinjung04 justinjung04 commented Aug 13, 2024

What this PR does:

Adds mutex on user list of a queue. This rarely happens, but was the root cause of the flaky test failing with the message below:

==================
WARNING: DATA RACE
Write at 0x00c000590000 by goroutine 52:
  github.com/cortexproject/cortex/pkg/scheduler/queue.(*queues).deleteQueue()
      /__w/cortex/cortex/pkg/scheduler/queue/user_queues.go:114 +0x1c4
  github.com/cortexproject/cortex/pkg/scheduler/queue.TestQueueConcurrency.func1()
      /__w/cortex/cortex/pkg/scheduler/queue/user_queues_test.go:486 +0x12a
  github.com/cortexproject/cortex/pkg/scheduler/queue.TestQueueConcurrency.gowrap1()
      /__w/cortex/cortex/pkg/scheduler/queue/user_queues_test.go:488 +0x41

Previous read at 0x00c000590000 by goroutine 51:
  github.com/cortexproject/cortex/pkg/scheduler/queue.(*queues).getNextQueueForQuerier()
      /__w/cortex/cortex/pkg/scheduler/queue/user_queues.go:234 +0x129
  github.com/cortexproject/cortex/pkg/scheduler/queue.TestQueueConcurrency.func1()
      /__w/cortex/cortex/pkg/scheduler/queue/user_queues_test.go:482 +0x1a4
  github.com/cortexproject/cortex/pkg/scheduler/queue.TestQueueConcurrency.gowrap1()
      /__w/cortex/cortex/pkg/scheduler/queue/user_queues_test.go:488 +0x41

Goroutine 52 (running) created at:
  github.com/cortexproject/cortex/pkg/scheduler/queue.TestQueueConcurrency()
      /__w/cortex/cortex/pkg/scheduler/queue/user_queues_test.go:477 +0x324
  testing.tRunner()
      /usr/local/go/src/testing/testing.go:1689 +0x21e
  testing.(*T).Run.gowrap1()
      /usr/local/go/src/testing/testing.go:1742 +0x44

Goroutine 51 (running) created at:
  github.com/cortexproject/cortex/pkg/scheduler/queue.TestQueueConcurrency()
      /__w/cortex/cortex/pkg/scheduler/queue/user_queues_test.go:477 +0x324
  testing.tRunner()
      /usr/local/go/src/testing/testing.go:1689 +0x21e
  testing.(*T).Run.gowrap1()
      /usr/local/go/src/testing/testing.go:1742 +0x44
==================

It seems like we always had this issue, but it was only surfaced after a race condition test was added in the previous PR.
Unfortunately I wasn't able to create a test case where I could reproduce this consistently.

Which issue(s) this PR fixes:
Fixes #6109

Checklist

  • [n/a] Tests updated
  • [n/a] Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

Signed-off-by: Justin Jung <[email protected]>
@danielblando
Copy link
Contributor

I was looking at the code and we already have a mutex for userQueuesMx and it looks like this is always blocked before the new mutex usersMx. Does it make sense to create a new mutex or the userQueue and users will need to be locked together? It seems we are always tracking both together. Maybe it will be just easier to make one mutex for both.

From the tests failing it seems that the func getNextQueueForQuerier is usually related to the issue and it is the only func which locks userQueuesMx after checking users. Isn't just simpler to make the lock earlier and have only one lock?

Delete queue already uses the userQueuesMx to also block users

q.userQueuesMx.Lock()

GetOrAddQueue also does the same

q.userQueuesMx.Lock()

GetNextQueueForQuerier is the only one that does it slight after

q.userQueuesMx.RLock()

@justinjung04
Copy link
Contributor Author

Confirmed that users is used for iteration when searching for next queue to handle. So,

  1. When we call getOrAddQueue, we append user name to that list
  2. When we call deleteQueue, we delete the user name from that list
  3. when we call getNextQueueForQuerier, we get the user name from the list with index, and go to the user queue map to get the queue

Basically the user list is a bridge to access the user queue at the end, so it makes sense to control them with a single lock. Will make an update.

Comment on lines -239 to -240
q.userQueuesMx.RLock()
defer q.userQueuesMx.RUnlock()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually this shouldn't have been in a for loop, as the defer doesn't get called until the function returns, not when each loop iteration is done.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the defer is fine... As it is a Rlock so re-entranable. By moving this lock out of the for loop, is the main point to protect q.users at L227? Then it looks good to me.

@@ -222,6 +221,9 @@ func (q *queues) createUserRequestQueue(userID string) userRequestQueue {
func (q *queues) getNextQueueForQuerier(lastUserIndex int, querierID string) (userRequestQueue, string, int) {
uid := lastUserIndex

q.queuesMx.RLock()
defer q.queuesMx.RUnlock()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also checked if there's an opportunity for me to not use defer and manually unlock (since now i'm locking two objects at the same time). But the slices and maps are pass by reference in golang, so it was better for me to keep the lock until the function returns (we continue to read properties of those objects until the function returns)

@justinjung04 justinjung04 marked this pull request as ready for review August 13, 2024 19:08
Copy link
Contributor

@danielblando danielblando left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

@danielblando danielblando requested a review from yeya24 August 13, 2024 20:19
CHANGELOG.md Outdated
@@ -53,6 +53,7 @@
* [BUGFIX] Ingester: Include out-of-order head compaction when compacting TSDB head. #6108
* [BUGFIX] Ingester: Fix `cortex_ingester_tsdb_mmap_chunks_total` metric. #6134
* [BUGFIX] Query Frontend: Fix query rejection bug for metadata queries. #6143
* [BUGFIX] Scheduler: Fix data race in user list of a queue. #6160
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need a dedicated entry for this fix. We can add the PR number to L52.

Comment on lines -239 to -240
q.userQueuesMx.RLock()
defer q.userQueuesMx.RUnlock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the defer is fine... As it is a Rlock so re-entranable. By moving this lock out of the for loop, is the main point to protect q.users at L227? Then it looks good to me.

Copy link
Contributor

@yeya24 yeya24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix!

@yeya24 yeya24 merged commit ee8f8e9 into cortexproject:master Aug 13, 2024
15 checks passed
@justinjung04 justinjung04 deleted the bugfix branch March 26, 2025 17:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Flaky test: TestQueueConcurrency triggered race condition
3 participants