Skip to content

Stop early-return after majority success in distributor writes #730

Closed
@bboreham

Description

@bboreham

Suppose we have a replication factor of 3, then the writing process is:

  • Distributor replicates the data three times and fires up three goroutines to deliver the data.
  • Once two of the calls have returned from ingesters, distributor returns success.
  • Context is cancelled (in distributor, because the call ended; in ruler, to avoid leaking timers)
  • Third call may or may not have updated its ingester.

I already think this causes #670; the idea that each replica may be missing some samples makes me more queasy.

In a typical Push() call containing 100 samples, the distributor will be writing to all ingesters (in different combinations for each metric), which will make the missing samples issue rare. And until #673 there was no timeout on ruler updates, so that context was never cancelled.

Let's just stop it - wait for all calls to return, or a timeout.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions