Skip to content

Conversation

pracucci
Copy link
Contributor

@pracucci pracucci commented Mar 3, 2021

What this PR does:
As described in the #3571, when shuffle-sharding is enabled in the query-frontend/scheduler and a querier crashes, tenants are immediately resharded across the remaining queriers. This practically invalidates the assumption that shuffle-sharding can be used to contain the blast radius in case of a "poisoned query" on the read path: if a tenant repeatedly send a poisoned query over and over it has the ability to crash all queriers, and not just its shard.

In this PR I propose a solution to mitigate it, introducing a delay ("forget delay") between when a querier disconnects because of a crash and when a tenant's shard changes because of that.

To do it, a query-frontend/scheduler needs to know when a querier disconnects because of crash. I've introduced a "graceful shutdown notification" from the querier to query-frontend/scheduler: when a querier disconnects from the query-frontend/scheduler without sending such notification it means the querier crashed / abruptly terminated.

I've done some manual testing and looks working as expected.

Which issue(s) this PR fixes:
Fixes #3571

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@pracucci pracucci requested a review from pstibrany March 3, 2021 13:49
@pracucci pracucci changed the title Fix querier shuffle sharding Fix querier shuffle-sharding blast radius containment Mar 3, 2021
@pracucci pracucci changed the title Fix querier shuffle-sharding blast radius containment Fix queriers shuffle-sharding blast radius containment Mar 3, 2021
@pracucci pracucci marked this pull request as draft March 3, 2021 13:50
Copy link
Contributor

@pstibrany pstibrany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job, looking forward to see the tests. My comments are mostly some grammar suggestions... if you find these annoying, let me know and I will stop. (Some may also be wrong, so... :))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I'd suggest to move forgetTimeout to user_queues.go, and pass only threshold to forgetDisconnectQueriers. removeQuerierConnection is also using this field, but only to know whether to keep disconnected querier or not. We could pass that information as a bool to removeQuerierConnection. This would simplify queues a bit, and allow for easier testing of forgetDisconnectQueriers. I would also suggest passing time.Now() as parameter to removeQuerierConnection, to avoid having dependency on current time in queues. WDYT?

@pracucci pracucci force-pushed the fix-querier-shuffle-sharding branch from 2ccc0d8 to 03cb76f Compare March 4, 2021 10:48
@pull-request-size pull-request-size bot added size/XL and removed size/L labels Mar 4, 2021
Copy link
Contributor

@ranton256 ranton256 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense to me functionally. I had a question about alternative naming of the configuration value being added, but otherwise LGTM.

@pracucci pracucci force-pushed the fix-querier-shuffle-sharding branch from 2732880 to 4ec324d Compare March 5, 2021 09:21
@pracucci pracucci marked this pull request as ready for review March 5, 2021 09:49
@pracucci pracucci requested a review from pstibrany March 5, 2021 09:50
Copy link
Contributor

@ranton256 ranton256 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this and the changes.

CHANGELOG.md Outdated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cortex release 1.8.0 is now in progress. Could you please rebase master and move the CHANGELOG entry under the master / unreleased section?

@pracucci pracucci force-pushed the fix-querier-shuffle-sharding branch from 68d9eca to 8b915a5 Compare March 8, 2021 08:31
Copy link
Contributor

@pstibrany pstibrany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, great work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: I would prefer to move forgetDelay out of queues struct, and pass forgetTime (when to forget given querier, or zero time if immediately) to removeQuerierConnection and only threshold to forgetDisconnectedQueriers. It's a small change, but helps to minimize change to the queues struct, which is already tricky.

@pracucci pracucci force-pushed the fix-querier-shuffle-sharding branch from 04220de to 1a2582d Compare March 8, 2021 16:11
pracucci and others added 12 commits March 9, 2021 08:30
Signed-off-by: Marco Pracucci <[email protected]>
Signed-off-by: Marco Pracucci <[email protected]>

Co-authored-by: Peter Štibraný <[email protected]>
Signed-off-by: Marco Pracucci <[email protected]>

Co-authored-by: Peter Štibraný <[email protected]>
Signed-off-by: Marco Pracucci <[email protected]>

Co-authored-by: Peter Štibraný <[email protected]>
Signed-off-by: Marco Pracucci <[email protected]>

Co-authored-by: Peter Štibraný <[email protected]>
pracucci and others added 14 commits March 9, 2021 08:30
Signed-off-by: Marco Pracucci <[email protected]>

Co-authored-by: Peter Štibraný <[email protected]>
Signed-off-by: Marco Pracucci <[email protected]>
Signed-off-by: Marco Pracucci <[email protected]>
Signed-off-by: Marco Pracucci <[email protected]>
Signed-off-by: Marco Pracucci <[email protected]>
Signed-off-by: Marco Pracucci <[email protected]>
Signed-off-by: Marco Pracucci <[email protected]>

Co-authored-by: Peter Štibraný <[email protected]>
Signed-off-by: Marco Pracucci <[email protected]>

Co-authored-by: Peter Štibraný <[email protected]>
@pracucci pracucci force-pushed the fix-querier-shuffle-sharding branch from 1a2582d to 1ca9855 Compare March 9, 2021 07:30
Signed-off-by: Marco Pracucci <[email protected]>
@pracucci pracucci merged commit 1ddb423 into cortexproject:master Mar 9, 2021
@pracucci pracucci deleted the fix-querier-shuffle-sharding branch March 9, 2021 08:43
roystchiang pushed a commit to roystchiang/cortex that referenced this pull request Apr 6, 2022
…#3901)

* Added forget timeout support to queues

Signed-off-by: Marco Pracucci <[email protected]>

* Added notify shutdown rpc to query-frontend and query-scheduler proto

Signed-off-by: Marco Pracucci <[email protected]>

* Querier worker notifies shutdown to query-frontend/scheduler

Signed-off-by: Marco Pracucci <[email protected]>

* Log when query-frontend/scheduler receives a shutdown notification

Signed-off-by: Marco Pracucci <[email protected]>

* Added config option to configure the forget timeout

Signed-off-by: Marco Pracucci <[email protected]>

* Fixed re-connect while in forget waiting period

Signed-off-by: Marco Pracucci <[email protected]>

* Fixed unit tests

Signed-off-by: Marco Pracucci <[email protected]>

* Fixed GetNextRequestForQuerier() when a resharding happen after fogetting a querier

Signed-off-by: Marco Pracucci <[email protected]>

* Update pkg/frontend/v1/frontend.go

Signed-off-by: Marco Pracucci <[email protected]>

Co-authored-by: Peter Štibraný <[email protected]>

* Update pkg/scheduler/queue/user_queues.go

Signed-off-by: Marco Pracucci <[email protected]>

Co-authored-by: Peter Štibraný <[email protected]>

* Update pkg/scheduler/queue/user_queues.go

Signed-off-by: Marco Pracucci <[email protected]>

Co-authored-by: Peter Štibraný <[email protected]>

* Update pkg/scheduler/scheduler.go

Signed-off-by: Marco Pracucci <[email protected]>

Co-authored-by: Peter Štibraný <[email protected]>

* Update pkg/querier/worker/frontend_processor.go

Signed-off-by: Marco Pracucci <[email protected]>

Co-authored-by: Peter Štibraný <[email protected]>

* Updated comment based on review feedback

Signed-off-by: Marco Pracucci <[email protected]>

* Updated comment based on review feedback

Signed-off-by: Marco Pracucci <[email protected]>

* Updated generated doc

Signed-off-by: Marco Pracucci <[email protected]>

* Added name to services

Signed-off-by: Marco Pracucci <[email protected]>

* Moved forgetCheckPeriod where it's used

Signed-off-by: Marco Pracucci <[email protected]>

* Added queues forget timeout unit tests

Signed-off-by: Marco Pracucci <[email protected]>

* Added RequestQueue unit test

Signed-off-by: Marco Pracucci <[email protected]>

* Renamed querier forget timeout into delay

Signed-off-by: Marco Pracucci <[email protected]>

* Added timeout to the notify shutdown notification

Signed-off-by: Marco Pracucci <[email protected]>

* Updated doc

Signed-off-by: Marco Pracucci <[email protected]>

* Added CHANGELOG entry

Signed-off-by: Marco Pracucci <[email protected]>

* Update pkg/scheduler/scheduler.go

Signed-off-by: Marco Pracucci <[email protected]>

Co-authored-by: Peter Štibraný <[email protected]>

* Update pkg/frontend/v1/frontend.go

Signed-off-by: Marco Pracucci <[email protected]>

Co-authored-by: Peter Štibraný <[email protected]>

* Updated doc

Signed-off-by: Marco Pracucci <[email protected]>

Co-authored-by: Peter Štibraný <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Shuffle-sharding of queriers may not work as intended
3 participants