Blocks storage unable to ingest samples older than 1h after an outage #2366

pracucci · 2020-03-31T09:28:15Z

TSDB doesn't allow to append samples whose timestamp is older than the last block cut from the head. Given a block is cut from the head up until -50% of the max timestamp within the head and given the default block range period is 2h, this means that the blocks storage doesn't allow to append a sample whose timestamp is older than 1h compared to the most recent timestamp in the head.

Let's consider this scenario:

Multiple Prometheus servers remote writing to the same Cortex tenant
Some Prometheus servers stop remote writing to Cortex (for any reason, ie. networking issue) and they fall behind more than 1h
When the Prometheus servers will be back online, Cortex will discard any sample whose timestamp is older than 1h because the max timestamp in the TSDB head is close to "now" (due to the working Prometheus servers which never stopped to write series) while the failing ones are trying to catch up writing samples older than 1h

We recently had an outage in our staging environment which triggered this condition and we should find a way to solve it.

@bwplotka You may be interested, given I think this issue affects Thanos receive too.

pracucci · 2020-03-31T09:29:30Z

@codesome Given we have vertical compaction in TSDB and we can have overlapping blocks, what would be the implications to allow to write samples "out of bounds" in the TSDB head?

pstibrany · 2020-03-31T09:46:21Z

Noting other possible options (not necessarily good options):

extend block range (would lead to higher ingester memory usage, and longer WAL replays)
~~change the cutting formula – I was trying to refer to 50% limit, but it's not related to cutting.~~

bwplotka · 2020-03-31T11:40:39Z

blocks storage doesn't allow to append a sample whose timestamp is older than 1h compared to the most recent timestamp in the head.

Really? Can we find code path in Prometheus which does it?

bwplotka · 2020-03-31T11:41:27Z

Also, clock skew can cause it.

pstibrany · 2020-03-31T11:44:48Z

blocks storage doesn't allow to append a sample whose timestamp is older than 1h compared to the most recent timestamp in the head.

Really? Can we find code path in Prometheus which does it?

New block is cut from the head, when head (in-memory data) covers more than 1.5x of the block range. For 2 hours block range, it means that head needs to have 3h of data to cut a block. Block "start time" is always the minimum sample time in the head, while block "end time" is aligned on block range boundary. New block always covers single "block range" period. Data stored into a block is then removed from the Head. That means that after cutting the block, head will already have at least 1h of data, but possibly even more.

When writing new data via appender, minimum time limit is computed as Max(minT, maxT-0.5*block range), where minT and maxT are minimum/maximum sample times in the head. Limit is then enforced in Add and AddFast methods.

When head covers less than 0.5 of block range (<1h for 2h block range), samples cannot be older than min time in the head. When head covers more than 0.5 of block range (>1h for 2h block range), samples cannot be older than half block range since max time in the head.

bwplotka · 2020-03-31T12:17:44Z

Thanks for the explanation. I think we are getting into the backfilling area.

I think opening a side block and use vertical compaction would solve it.

bwplotka · 2020-03-31T12:19:16Z

We could start Prometheus discussion for it as well to enable that behavior if vertical compaction is enabled for TSDB itself.

pracucci · 2020-03-31T12:22:05Z

I think we are getting into the backfilling area.

It depends. How long time back is considered backfilling? From the Cortex (or Thanos receive) perspective, I think the issue we're describing here is not considered backfilling. At the current stage, we can't tolerate an outage longer than about 1h or we'll lose data.

brancz · 2020-03-31T12:25:38Z

Maybe it means that remote write is not good enough. You only have up to 2 hours to even still have that WAL around in this scenario, at which point it would be cut to a block. Maybe we need to discuss back-filling with blocks as a remote write extension. Just thinking out loud. (I think I briefly discussed this with @csmarchbanks once before)

bwplotka · 2020-03-31T12:31:09Z

I think the issue we're describing here is not considered backfilling. At the current stage, we can't tolerate an outage longer than about 1h or we'll lose data.

Where is the boundary?

pracucci · 2020-03-31T12:32:34Z

I think the issue we're describing here is not considered backfilling. At the current stage, we can't tolerate an outage longer than about 1h or we'll lose data.

Where is the boundary?

From the pure UX side, if it's about a Prometheus server catching up after an outage then I wouldn't consider it backfilling.

bwplotka · 2020-03-31T12:33:46Z

I guess the boundary is then 1.5x block size (once we exceed WAL (3h))?

bwplotka · 2020-03-31T12:34:44Z

Actually user can change that, so we can make even 2 years WAL. If some one will not upload things for 2y and suddenly want to put 2y old sample, would that be still not backfill 🤔 ?

brancz · 2020-03-31T12:54:11Z

Sounds like concretely for this, we need a way to cut a head block safely that doesn't have the 1.5x time requirement/heuristic.

bwplotka · 2020-03-31T13:24:55Z

Sounds like concretely for this, we need a way to cut a head block safely that doesn't have the 1.5x time requirement/heuristic.

What do you mean? When would you cut then? You don't know upfront what writes you expect to see but don't see, no?

brancz · 2020-03-31T16:32:24Z

The heuristic of allowing inserts up to 0.5x timespan of head blocks is based on the assumption that we can safely and correctly cut blocks at that time, I'm wondering what other strategies there might be. Clearly other databases do different things and time-based things are actually kind of weird in the first place. What I'm trying to say is, if we remove that requirement, then we might be able to think of ways how we can improve this situation (potentially combined with vertical compaction?).

csmarchbanks · 2020-03-31T19:28:04Z

Maybe we need to discuss back-filling with blocks as a remote write extension

I think I have some code sitting around somewhere that does this (I was using it to populate datasets from Prometheus into various backends that supported remote write). If there is interested I'd be happy to dig it up again.

we need a way to cut a head block safely that doesn't have the 1.5x time requirement/heuristic

Yes, that would be great. There were some ideas around this when we were discussing how to limit Prometheus memory usage weren't there? I remember at least something around space-based head block.

codesome · 2020-04-06T06:34:22Z

Catching up with emails now :) looks like I missed some discussions

@codesome Given we have vertical compaction in TSDB and we can have overlapping blocks, what would be the implications to allow to write samples "out of bounds" in the TSDB head?

While "out of bound" in TSDB would work fine, it needs some more discussion if it has to be upstream. Also, talking w.r.t. cortex, you will have an unexpected rise in memory consumption because Head block gets bigger than expected. (Additionally, vertical queries and compactions are a tad bit more expensive in terms of CPU and Memory)

I think opening a side block and use vertical compaction would solve it.

Is this idea for upstream Prometheus or Thanos/Cortex? But anyway, do we have any requirement that data is available for querying soon after ingesting?

extend block range (would lead to higher ingester memory usage, and longer WAL replays)

With the m-mapping work that is going on, the memory usage can be taken care of. And if this partial chunks work looks good to maintainers (follow up of m-map work), that would also take care of WAL replays :). But this would mean Cortex can increase it's block range, but the default in upstream Prometheus would need to be changed too so that WAL is kept around longer.

codesome · 2020-04-06T06:39:55Z

I would as much try to avoid adding samples older than Head mint and bring vertical compaction into play in the upstream Prometheus, because (1) Code is already complex enough (is that a valid argument? :P) (2) If not used it correctly, users will silently lose/corrupt data. (3) Unexpected spikes in CPU/Memory (maybe this should be expected?)

If this could be an optional flag (just like for overlapping data), we can forget about point 2 and 3.

pracucci · 2020-04-06T07:14:36Z

Also, talking w.r.t. cortex, you will have an unexpected rise in memory consumption because Head block gets bigger than expected.

Is this true even after the m-mapping work is complete?

I think opening a side block and use vertical compaction would solve it.
But anyway, do we have any requirement that data is available for querying soon after ingesting?

Yes, we should be able to immediately query back samples pushed to the ingesters, like it's working for the chunks storage (this issue affects only the blocks storage).

If this could be an optional flag (just like for overlapping data), we can forget about point 2 and 3.

I was thinking about an optional flag (like overlapping data).

stale · 2020-06-05T07:59:41Z

This issue has been automatically marked as stale because it has not had any activity in the past 60 days. It will be closed in 15 days if no further activity occurs. Thank you for your contributions.

pracucci · 2020-06-05T08:05:22Z

This is still valid. @codesome has some ideas he wanna experiment.

bwplotka · 2020-06-05T09:31:44Z

Can you share those ideas Ganesh?

…

On Fri, 5 Jun 2020 at 09:05, Marco Pracucci ***@***.***> wrote: *This is still valid. @codesome <https://github.com/codesome> has some ideas he wanna experiment.* — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2366 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABVA3OZY3M5UICNW2ZTVOQDRVCRNFANCNFSM4LXMKN6Q> .

codesome · 2020-06-05T09:39:31Z

Can you share those ideas Ganesh?

It was one of my TODO for today :) I will share with you all once I have written it down.

codesome · 2020-06-05T14:02:28Z

This is an idea from the top of my head, needs more thought

The root problem

These 2 checks, this and this

Solution

Enable vertical compaction and querier and make that check optional - upstream will enable the check whereas Cortex and Thanos can turn that off.

Why does this work?

Because we are not discarding out of bound sample and the sample would just get added to it's series. Out of order samples in a series are still discarded.

If a series lagging behind in time causes overlap with data on disk, the vertical querier will take care of deduping.

After compaction of head, the vertical blocks are given top priority and they will get merged.

Any gotchas? Yes

Head compaction is based on the mint and maxt of the Head. One can generate synthetic data to add just 2 samples per series ranging over 3h and cause lots of unnecessary Head compactions. This can be taken care of manually calling Compact() with some logic around it instead of calling it very often.
It is possible to add samples with time range spanning a very large time range - this can end up compacting all blocks into one. One possible solution is instead of removing the out of bound check entirely, make the minValidTime configurable so that you can decide how long back in time you want to accept.

pracucci · 2020-06-08T06:56:34Z

Thanks @codesome for sharing your idea!

As you stated, a downside of this approach is that you may end up with a single large block which will impact the compaction. This is a solvable problem, replacing TSDB compactor planner with a custom one (ie. we could exclude those not-aligned blocks from compaction or compact them together without impacting correctly-aligned blocks).

The longer the time range for such blocks is, the more it's problematic at query time so, as you stated, we may add a limit to the oldest timestamp we do allow to ingest (now - threshold). This would practically make this system not working for backfilling purposes (because the threshold would be in terms of hours, not days/months) but may solve the problem described in this issue.

After compaction of head, the vertical blocks are given top priority and they will get merged.

Ideally we don't want any vertical compaction occur in the ingesters. Vertical compaction will be done by the Cortex compactor later on. What do you think?

codesome · 2020-07-31T09:21:58Z

The above proposal having multiple TSDBs open for backfilling was discarded because of lot of unnecessary complexity that it brings into the code and the additional memory required in the worst case (for example, if you want to backfill upto 6h old samples, you might need upto 6-7x the memory).

With the scope of this being only to recover from long outage either on the Prometheus side or Cortex, and not actually backfilling old data, here is a simplified design:

Taking an example of allowing samples upto 7h old (this will be configurable), 1h is taken care by the main TSDB, and for the remaining 6h we maintain only 2 TSDBs.

Both the new TSDBs will be able to ingest 6h of data (this is done by keeping the block range to 12h). The time ranges allowed in those blocks will be aligned with 6h (or the configured time).

Here is some visualization (M is the current TSDB. X and Y are the newly introduced TSDBs):

1.
                   <     6h     >
                   |            |---|
                   |            | M |
                   |            |---|
                   |------------|
                   |      X     |
                   |------------|

2. Time is moving, both X and Y will cover that 6h partially.

                   <     6h     >
                   |            |---|
                   |            | M |
                   |            |---|
            |------------|------------|
            |      X     |      Y     |
            |------------|------------|

3. First one goes beyond max configured, hence compact and ship the block

                   <     6h     >
                   |            |---|
                   |            | M |
                   |            |---|
      |------------|------------|
      |      X     |      Y     |
      |------------|------------|

4. The TSDBs are rotated at this point, and the same thing continues.

                   <     6h     >
                   |            |---|
                   |            | M |
                   |            |---|
            |------------|------------|
            |      Y     |      X     |
            |------------|------------|

We have two of them as when one is compacting, the other should be up. And as the time is moving, both together cover that 6h gap.

Small implementation details:

The new TSDBs are created iff any sample is received for those time range and it was not deemed out-of-order in the main TSDB.
The 7h max age is calculated w.r.t. the wall clocks and not the max time of the main TSDB, this is to avoid sudden advancement of that maxt which can cause both old TSDBs to be unavailable as they need to be compacted before they can ingest new data.
The shutdown, flush and shutdown handler, etc, on these new TSDBs, will behave similar to the main TSDB.
Transfers during handover won't be done for these new TSDBs (moreover, handover might be entirely removed for block Proposal: Remove support for TSDB blocks transfer between ingesters #2966)

Resource:

This can take upto 3x the memory if all the tenants plan to send the old data for the max age (with memory-mapping of head chunk, it's only the number of series that matters the most for the memory).
Needs more disk space to support the new TSDBs. Exact/rough numbers yet to be calculated for this.

brancz · 2020-07-31T09:37:22Z

This can take upto 3x the memory if all the tenants plan to send the old data for the max age (with memory-mapping of head chunk, it's only the number of series that matters the most for the memory).

I don't understand what this sentence means. Does it take 3x memory or is that overhead optimized away by mmaping of head chunks?

codesome · 2020-07-31T12:40:22Z

Does it take 3x memory or is that overhead optimized away by mmaping of head chunks?

It can take upto 3x more memory if all 3 TSDBs are in action. With mmapping of head, the main factor that decides the memory taken by TSDB is number of series in head and the time range does not matter, because only 1 chunk per series and the labels/index will be stored in the memory at any point of time and remaining chunks on disk.

So in normal usage when you don't need to ingest old data, there is 0 overhead, because no additional TSDB is open. If one more TSDB comes into actions to backfill old data and if it has a similar set of series that of main TSDB, then it's like 2x the memory consumption as before. If third TSDB also comes into action because the samples are spanning a wider range, then 3x in the worst case. And this is all for per tenant memory increase. So to hit 3x memory usage for entire ingesters, all tenants should be sending old data at the same time (if it's just 1 tenant, then the probability increases).

brancz · 2020-08-03T08:27:47Z

Understood thank you for explaining. I expect that this happens primarily on extended downtime of the backend, so it's rather likely that this happens for many tenants at once no? In such a scenario do you think the resulting thundering herd would be managable?

itzg · 2020-08-03T13:31:12Z

In our case the thundering herd is expected since our tenants will demand that the 6 hours of backfill be eventually available to visualize at some point. After an outage, spinning up larger memory pods seems like a reasonable operation to accommodate this design.

Thanks, @codesome , this is looking very promising.

codesome · 2020-08-10T10:05:19Z

In such a scenario do you think the resulting thundering herd would be managable?

If Cortex is down, ingesting >1h old samples will come into picture only when tenants have multiple Prometheus instances and some instances catch up faster than the others hence pushing the maxt of the main TSDB ahead. If not, it will be a single TSDB as before.

Considering the worst case of all requiring ingestion of >1h old samples, if the ingesters are already provisioned to use at least 3x the normal memory usage, I think things should be fine in terms of memory (with m-mapping the memory used will be capped in absence of queries). But CPU throttling can be expected, driving the latency higher, and maybe leading to distributors OOMing.

This is all considering ideal scenario where user does not have lot of churn. But from what we have learnt from past outages in the non-blocks version of Cortex, it is advised to scale up the ingesters (also the distributors) when the cluster is recovering from an outage to manage the thundering herd.

bwplotka · 2020-08-11T05:44:29Z

Hey, thanks for this. This can work, but in fact, what we propose here to me is literally multi TSDB (: so: How this is different than multiple TSDBs capped at T-7h hours? Is it block range being 6h?

pracucci · 2020-09-28T14:43:46Z

We're still working on it

MedoDome · 2020-11-20T13:20:03Z

Any information about this? I would like to add some historical data into Cortex

pracucci · 2020-11-20T14:03:42Z

Any information about this? I would like to add some historical data into Cortex

The approach we took in #3025 should work but has some downsides:

The backfilling code path is exercised only during a backfilling, so bugs may slip in and we may not notice it until a backfilling event happens
The ingester memory utilisation could potentially grow 3x, risking to hit the ingester memory limit and causing an outage. If the backfilling covers the case of recovering after an outage, we may end up with an outage after another one

We're still evaluating pros/cons of #3025, while discussing alternative ideas.

MedoDome · 2020-11-20T15:37:29Z

Thank you on the fast reply, I appreciate it

bboreham · 2021-03-20T20:30:21Z

Any update after a few months?

rajsameer · 2021-04-08T11:52:49Z

hey guys, We had a similar issues , where when writing metrics to cortex for a new tenant after few minutes the ingester would start throwing out of bound error. The reason we found for this is we mistakenly did not disable alert and rules in our Prometheus and added the same rules and alerts for the tenant in cortex. Now if the alert gets evaluated by the cortex first and then it receives the sample form Prometheus which is also calculating the same rules, ingester would start throwing out of bound error , and this will happen for ever because Prometheus will keep on trying to send the same data and cortex would reject. So just to test this hypothesis we disabled rules and alerts on cortex and remote write is working fine. Should we add a note in the documentation stating not have same rules in Prometheus sand alert manager.

damnever · 2021-04-09T07:26:47Z

We run into this most likely due to the distributor gets overloaded... this means the middlewares(gateway/proxy) and unstable network may cause this issue.

bboreham · 2021-04-09T07:57:54Z

Prometheus which is also calculating the same rules

@rajsameer that doesn't sound like the same thing.
Cortex will reject duplicate data with a 400 error, which means Prometheus will drop it.
For rules, Prometheus will re-evaluate the rule a short while later and send new data, which is also rejected as duplicate and dropped.

Happy to receive PRs about how to configure rules, but it's nothing to do with this issue.

If you have some specific details about your "out of bounds" error please open a new issue so those can be understood.

alexku7 · 2021-10-09T17:26:26Z

Hello

Last week, we faced this issue.Cortex refused to accept metrics after some long outage with the error "out of bound".

We had to restart all the prometheus servers sending their data to Cortex. Only restart solved this problem .

Just wanted to ask to any update regarding this issue . Is there any fix or workaround ?

codesome · 2021-10-11T09:07:09Z

prometheus/prometheus#8535 (comment) <- which will eventually get into Cortex

bwplotka · 2021-10-28T10:33:24Z

Super simple idea: prometheus/prometheus#9607

Quick win, immidiate value.

kojderszymon · 2022-10-06T08:10:43Z

Any update after changes in Prometheus - prometheus/prometheus#11075 ?

pracucci mentioned this issue Mar 31, 2020

Documented blocks storage limitation #2367

Merged

3 tasks

stale bot added the stale label Jun 5, 2020

pracucci added the keepalive Skipped by stale bot label Jun 5, 2020

stale bot removed the stale label Jun 5, 2020

codesome mentioned this issue Aug 12, 2020

Ingest samples older than 1h for block store #3025

Closed

12 tasks

pracucci removed the keepalive Skipped by stale bot label Sep 28, 2020

bboreham added the keepalive Skipped by stale bot label Feb 4, 2021

prprasad2020 mentioned this issue May 11, 2021

One Ingester PV is piling up the collected data and not sending to S3 #4115

Closed

grafanabot mentioned this issue Aug 10, 2021

Blocks storage unable to ingest samples older than 1h after an outage grafana/mimir#116

Closed

bwplotka mentioned this issue Oct 28, 2021

feature request: Add agent support for beyond WAL persistent storage and metric export prometheus/prometheus#9607

Open

Kampe mentioned this issue Nov 9, 2021

Thanos Receive: "Out of bounds metric" thanos-io/thanos#4831

Open

vgavro mentioned this issue Dec 22, 2021

Allow historical logs import, bypass "entry too far behind" grafana/loki#4914

Closed

alexandreLamarre mentioned this issue Aug 18, 2022

SLO Preview API rancher/opni#503

Closed

alvinlin123 mentioned this issue Oct 6, 2022

Support for "Out-of-order" #4895

Closed

friedrichg mentioned this issue Mar 8, 2023

Error "out of bounds" after uploading data from future #4868

Closed

Blocks storage unable to ingest samples older than 1h after an outage #2366

Blocks storage unable to ingest samples older than 1h after an outage #2366

Comments

pracucci commented Mar 31, 2020

pracucci commented Mar 31, 2020

pstibrany commented Mar 31, 2020 • edited Loading

bwplotka commented Mar 31, 2020

bwplotka commented Mar 31, 2020

pstibrany commented Mar 31, 2020 • edited Loading

bwplotka commented Mar 31, 2020

bwplotka commented Mar 31, 2020

pracucci commented Mar 31, 2020 • edited Loading

brancz commented Mar 31, 2020 • edited Loading

bwplotka commented Mar 31, 2020

pracucci commented Mar 31, 2020

bwplotka commented Mar 31, 2020

bwplotka commented Mar 31, 2020

brancz commented Mar 31, 2020

bwplotka commented Mar 31, 2020 • edited Loading

brancz commented Mar 31, 2020

csmarchbanks commented Mar 31, 2020

codesome commented Apr 6, 2020

codesome commented Apr 6, 2020

pracucci commented Apr 6, 2020

stale bot commented Jun 5, 2020

pracucci commented Jun 5, 2020

bwplotka commented Jun 5, 2020 via email

codesome commented Jun 5, 2020

codesome commented Jun 5, 2020

pracucci commented Jun 8, 2020

codesome commented Jul 31, 2020

brancz commented Jul 31, 2020

codesome commented Jul 31, 2020

brancz commented Aug 3, 2020

itzg commented Aug 3, 2020

codesome commented Aug 10, 2020

bwplotka commented Aug 11, 2020

pracucci commented Sep 28, 2020

MedoDome commented Nov 20, 2020 • edited Loading

pracucci commented Nov 20, 2020

MedoDome commented Nov 20, 2020

bboreham commented Mar 20, 2021

rajsameer commented Apr 8, 2021

damnever commented Apr 9, 2021

bboreham commented Apr 9, 2021

alexku7 commented Oct 9, 2021

codesome commented Oct 11, 2021

bwplotka commented Oct 28, 2021

kojderszymon commented Oct 6, 2022

pstibrany commented Mar 31, 2020 •

edited

Loading

pstibrany commented Mar 31, 2020 •

edited

Loading

pracucci commented Mar 31, 2020 •

edited

Loading

brancz commented Mar 31, 2020 •

edited

Loading

bwplotka commented Mar 31, 2020 •

edited

Loading

MedoDome commented Nov 20, 2020 •

edited

Loading