-
Notifications
You must be signed in to change notification settings - Fork 816
Fix data race in user list of a queue #6160
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Justin Jung <[email protected]>
Signed-off-by: Justin Jung <[email protected]>
I was looking at the code and we already have a mutex for From the tests failing it seems that the func Delete queue already uses the userQueuesMx to also block users cortex/pkg/scheduler/queue/user_queues.go Line 106 in f088997
GetOrAddQueue also does the same cortex/pkg/scheduler/queue/user_queues.go Line 141 in f088997
GetNextQueueForQuerier is the only one that does it slight after cortex/pkg/scheduler/queue/user_queues.go Line 247 in f088997
|
Confirmed that
Basically the user list is a bridge to access the user queue at the end, so it makes sense to control them with a single lock. Will make an update. |
q.userQueuesMx.RLock() | ||
defer q.userQueuesMx.RUnlock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually this shouldn't have been in a for loop, as the defer doesn't get called until the function returns, not when each loop iteration is done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the defer is fine... As it is a Rlock so re-entranable. By moving this lock out of the for loop, is the main point to protect q.users
at L227? Then it looks good to me.
@@ -222,6 +221,9 @@ func (q *queues) createUserRequestQueue(userID string) userRequestQueue { | |||
func (q *queues) getNextQueueForQuerier(lastUserIndex int, querierID string) (userRequestQueue, string, int) { | |||
uid := lastUserIndex | |||
|
|||
q.queuesMx.RLock() | |||
defer q.queuesMx.RUnlock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also checked if there's an opportunity for me to not use defer and manually unlock (since now i'm locking two objects at the same time). But the slices and maps are pass by reference in golang, so it was better for me to keep the lock until the function returns (we continue to read properties of those objects until the function returns)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks
CHANGELOG.md
Outdated
@@ -53,6 +53,7 @@ | |||
* [BUGFIX] Ingester: Include out-of-order head compaction when compacting TSDB head. #6108 | |||
* [BUGFIX] Ingester: Fix `cortex_ingester_tsdb_mmap_chunks_total` metric. #6134 | |||
* [BUGFIX] Query Frontend: Fix query rejection bug for metadata queries. #6143 | |||
* [BUGFIX] Scheduler: Fix data race in user list of a queue. #6160 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need a dedicated entry for this fix. We can add the PR number to L52.
q.userQueuesMx.RLock() | ||
defer q.userQueuesMx.RUnlock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the defer is fine... As it is a Rlock so re-entranable. By moving this lock out of the for loop, is the main point to protect q.users
at L227? Then it looks good to me.
Signed-off-by: Justin Jung <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fix!
What this PR does:
Adds mutex on user list of a queue. This rarely happens, but was the root cause of the flaky test failing with the message below:
It seems like we always had this issue, but it was only surfaced after a race condition test was added in the previous PR.
Unfortunately I wasn't able to create a test case where I could reproduce this consistently.
Which issue(s) this PR fixes:
Fixes #6109
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]