Skip to content

Use most recent time to remove from ring #4501

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

danielblando
Copy link
Contributor

@danielblando danielblando commented Sep 29, 2021

Signed-off-by: Daniel Blando [email protected]

What this PR does:
We previously did a change for forget instance from the ring on 4603. In some rare cases the timestamp in the ring can be more recent than our current time (eg. clock skew into the future).
This change make sure we can remove the instance even if these cases occcurs

Which issue(s) this PR fixes:
Fixes #4461

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@danielblando danielblando marked this pull request as draft September 29, 2021 21:16
@danielblando danielblando marked this pull request as ready for review September 30, 2021 00:25
// If timestamp on the ring is in "the future" we need to use the timestamp of the ring
// otherwise the gossip message will be ignored by the peers.
// see https://github.com/cortexproject/cortex/blob/d3f46c53616dcdf5c670f3e3a4d371e44b10683e/pkg/ring/model.go#L192-L200
if ting.Timestamp < now.Unix() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure this fixes your issue. We must update the timestamp, otherwise the change is not propagated via gossip, so I think that what you want is something more like this:

if ting.Timestamp < now.Unix() {
    ting.Timestamp = now.Unix()
} else {
    ting.Timestamp += 1
}

Copy link
Contributor Author

@danielblando danielblando Oct 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pracucci thanks for looking into this.

We do add the LEFT change right after this line in the updates list.

updated = append(updated, name)

which will go onto the changes returned

cortex/pkg/ring/model.go

Lines 236 to 241 in 5e7f104

// Let's build a "change" for returning
out := NewDesc()
for _, u := range updated {
ing := thisIngesterMap[u]
out.Ingesters[u] = ing
}

Form what i can tell we don't depend on timestamp

I think what we need to make sure is that the misconfigured ingester be shut down before forgetting it, which will make sure it does not rejoin the ring.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO it's best to put a newer timestamp on newer data.
That way everyone in the gossip will treat it as new, not be uncertain whether they should use the other info.

Copy link
Contributor

@pstibrany pstibrany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! This change makes sense to me: When we "forget" entry that has time in the future, we don't need to update the time (as it's already , but set state to LEFT. The change will be propagated via gossip, and other nodes will accept it unless they have even more recent entry.

Comment on lines +63 to 64
* [BUGFIX] Memberlist: Remove entry from the ring even if the timestamp is in the future #4501

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This entry should be moved to master / unreleased section now.

@pstibrany
Copy link
Contributor

One problem I can see with this change is that if ingester (or other component) is restarted, and it uses same instance ID but no longer has clock-skew, it will be unable to join the ring, if there is still LEFT entry with future time :( This seems bit dangerous.

@pstibrany
Copy link
Contributor

One problem I can see with this change is that if ingester (or other component) is restarted, and it uses same instance ID but no longer has clock-skew, it will be unable to join the ring, if there is still LEFT entry with future time :( This seems bit dangerous.

Although this assumes skew more than just a few seconds... hopefully that's not the case!

@danielblando
Copy link
Contributor Author

Hey @pstibrany, thanks for the time.

Good catch, in our case the clock skew was 3 days which would made very long to rejoin even with the change. This would make this change not much helpful. We could try to add a flag for this scenarios, but seems just creating more code for issues.

I think in this case the best approach is to refuse the bad info from the beginning and not let that in the ring. I was trying to do that in this draft, but the problem is how to add a configuration in Merge implementation. To add in the context we would need to insert a new info on the Desc which would make this spread in the protocol.

message Desc {
map<string,InstanceDesc> ingesters = 1 [(gogoproto.nullable) = false];
reserved 2;
}

Do you have any opinion on this approach?

@pstibrany
Copy link
Contributor

I think in this case the best approach is to refuse the bad info from the beginning and not let that in the ring. I was trying to do that in this draft, but the problem is how to add a configuration in Merge implementation. To add in the context we would need to insert a new info on the Desc which would make this spread in the protocol.

One problem is how to pass that information to Desc, that would be solveable. What is more difficult is how to make that work properly with requirements on Merge method. I don't see any easy answers there. By breaking Merge properties we would introduce more problems into how gossip works :( I think the best approach would be to make sure that clocks are reasonably synced in the first place, if possible.

@danielblando
Copy link
Contributor Author

I think the best approach would be to make sure that clocks are reasonably synced in the first place, if possible.

That is a little trick, no? I agree that these are rare scenarios, but we can't assume it will not happen. In my case, I had it running in EC2 with AWS NTP and still happened.

Screen Shot 2021-08-12 at 5 40 42 PM

I think the biggest problem is that we don't have a way out of these situation right now. The best solution to get the ingester back to the ring would be restarting the ring.

One problem is how to pass that information to Desc, that would be solveable. What is more difficult is how to make that work properly with requirements on Merge method.

Agree that it can be solvable, it is just a matter on how we decide to introduce that information. I understand your point of breaking the "Commutativity" requirement, but denying the information would make an error pop up and leave the wrong configured ingester without update. The information would not be gossip forward. From the error we could act on it to fix the issue.

@stale
Copy link

stale bot commented Mar 2, 2022

This issue has been automatically marked as stale because it has not had any activity in the past 60 days. It will be closed in 15 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Mar 2, 2022
@stale stale bot closed this Mar 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Clock Skew can cause Ingesters to be unable to join the ring or be forgotten.
5 participants