-
Notifications
You must be signed in to change notification settings - Fork 833
Fix TestAlertmanager_StateReplicationWithSharding. #4078
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The test had two issues: 1. The most common issue was that replicating the silence sometimes took a small amount of time, as it happens asynchronously. Replacing Equal with Eventually when checking the replication metrics is enough to make the tests reliable. 2. The second issue was more rare: When a particular shard does not have any users assigned to it, _and_ the test randomly picks _that_ shard to write the silence to, the test would panic (in rand.Intn). Due to the alertmanagersMtx being held, the test actually just hung up while trying to shutdown the alertmanagers. Signed-off-by: Steve Simpson <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree on case 1, for case 2 I'd appreciate a set of eyes from other maintainers as I'm not entirely sure I understand why this doesn't work as-is and we might be missing something from the core logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm thanks!
Signed-off-by: Steve Simpson <[email protected]>
The test had two issues:
The most common issue was that replicating the silence sometimes
took a small amount of time, as it happens asynchronously. Replacing
Equal with Eventually when checking the replication metrics is enough
to make the tests reliable.
The second issue was more rare: When a particular shard does not have
any users assigned to it, and the test randomly picks that shard
to write the silence to, the test would panic (in rand.Intn). Due to
the alertmanagersMtx being held, the test actually just hung up while
trying to shutdown the alertmanagers.
Signed-off-by: Steve Simpson [email protected]