Samples can get out of order between distributor and ingester

For some instances we see "sample timestamp out of order for series" in our logs with a gap between previous and new timestamp of 15 or 30 seconds.
If this was going wrong in the sending Prometheus we would see the same error on all ingester replicas.  We do not: they are reported sporadically on one ingester at a time.  From this I deduce the out-of-order is happening inside Cortex.

Here is my best theory: suppose some client Prometheus has hundreds of samples queued up for remote write, then the following can happen:
- Prometheus sends 100 samples to distributor.
- Distributor replicates the data three times and fires up three goroutines to deliver the data.
- Once two of the calls have returned from ingesters, distributor returns success to prometheus.
- Third call continues, on its goroutine.
- Prometheus sends the next 100 samples; distributor (likely on another node) fires up another 3 goroutines.
- One of those goroutines can overtake the third one from the previous call.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Samples can get out of order between distributor and ingester #670

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Samples can get out of order between distributor and ingester #670

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions