Skip to content

Blocks storage unable to ingest samples older than 1h after an outage #2366

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
pracucci opened this issue Mar 31, 2020 · 55 comments
Open

Blocks storage unable to ingest samples older than 1h after an outage #2366

pracucci opened this issue Mar 31, 2020 · 55 comments
Labels
keepalive Skipped by stale bot storage/blocks Blocks storage engine

Comments

@pracucci
Copy link
Contributor

TSDB doesn't allow to append samples whose timestamp is older than the last block cut from the head. Given a block is cut from the head up until -50% of the max timestamp within the head and given the default block range period is 2h, this means that the blocks storage doesn't allow to append a sample whose timestamp is older than 1h compared to the most recent timestamp in the head.

Let's consider this scenario:

  • Multiple Prometheus servers remote writing to the same Cortex tenant
  • Some Prometheus servers stop remote writing to Cortex (for any reason, ie. networking issue) and they fall behind more than 1h
  • When the Prometheus servers will be back online, Cortex will discard any sample whose timestamp is older than 1h because the max timestamp in the TSDB head is close to "now" (due to the working Prometheus servers which never stopped to write series) while the failing ones are trying to catch up writing samples older than 1h

We recently had an outage in our staging environment which triggered this condition and we should find a way to solve it.

@bwplotka You may be interested, given I think this issue affects Thanos receive too.

@pracucci
Copy link
Contributor Author

@codesome Given we have vertical compaction in TSDB and we can have overlapping blocks, what would be the implications to allow to write samples "out of bounds" in the TSDB head?

@pstibrany
Copy link
Contributor

pstibrany commented Mar 31, 2020

Noting other possible options (not necessarily good options):

  • extend block range (would lead to higher ingester memory usage, and longer WAL replays)
  • change the cutting formula – I was trying to refer to 50% limit, but it's not related to cutting.

@bwplotka
Copy link
Contributor

blocks storage doesn't allow to append a sample whose timestamp is older than 1h compared to the most recent timestamp in the head.

Really? Can we find code path in Prometheus which does it?

@bwplotka
Copy link
Contributor

Also, clock skew can cause it.

@pstibrany
Copy link
Contributor

pstibrany commented Mar 31, 2020

blocks storage doesn't allow to append a sample whose timestamp is older than 1h compared to the most recent timestamp in the head.

Really? Can we find code path in Prometheus which does it?

New block is cut from the head, when head (in-memory data) covers more than 1.5x of the block range. For 2 hours block range, it means that head needs to have 3h of data to cut a block. Block "start time" is always the minimum sample time in the head, while block "end time" is aligned on block range boundary. New block always covers single "block range" period. Data stored into a block is then removed from the Head. That means that after cutting the block, head will already have at least 1h of data, but possibly even more.

When writing new data via appender, minimum time limit is computed as Max(minT, maxT-0.5*block range), where minT and maxT are minimum/maximum sample times in the head. Limit is then enforced in Add and AddFast methods.

When head covers less than 0.5 of block range (<1h for 2h block range), samples cannot be older than min time in the head. When head covers more than 0.5 of block range (>1h for 2h block range), samples cannot be older than half block range since max time in the head.

@bwplotka
Copy link
Contributor

Thanks for the explanation. I think we are getting into the backfilling area.

I think opening a side block and use vertical compaction would solve it.

@bwplotka
Copy link
Contributor

We could start Prometheus discussion for it as well to enable that behavior if vertical compaction is enabled for TSDB itself.

@pracucci
Copy link
Contributor Author

pracucci commented Mar 31, 2020

I think we are getting into the backfilling area.

It depends. How long time back is considered backfilling? From the Cortex (or Thanos receive) perspective, I think the issue we're describing here is not considered backfilling. At the current stage, we can't tolerate an outage longer than about 1h or we'll lose data.

@brancz
Copy link
Contributor

brancz commented Mar 31, 2020

Maybe it means that remote write is not good enough. You only have up to 2 hours to even still have that WAL around in this scenario, at which point it would be cut to a block. Maybe we need to discuss back-filling with blocks as a remote write extension. Just thinking out loud. (I think I briefly discussed this with @csmarchbanks once before)

@bwplotka
Copy link
Contributor

I think the issue we're describing here is not considered backfilling. At the current stage, we can't tolerate an outage longer than about 1h or we'll lose data.

Where is the boundary?

@pracucci
Copy link
Contributor Author

I think the issue we're describing here is not considered backfilling. At the current stage, we can't tolerate an outage longer than about 1h or we'll lose data.

Where is the boundary?

From the pure UX side, if it's about a Prometheus server catching up after an outage then I wouldn't consider it backfilling.

@bwplotka
Copy link
Contributor

I guess the boundary is then 1.5x block size (once we exceed WAL (3h))?

@bwplotka
Copy link
Contributor

Actually user can change that, so we can make even 2 years WAL. If some one will not upload things for 2y and suddenly want to put 2y old sample, would that be still not backfill 🤔 ?

@brancz
Copy link
Contributor

brancz commented Mar 31, 2020

Sounds like concretely for this, we need a way to cut a head block safely that doesn't have the 1.5x time requirement/heuristic.

@bwplotka
Copy link
Contributor

bwplotka commented Mar 31, 2020

Sounds like concretely for this, we need a way to cut a head block safely that doesn't have the 1.5x time requirement/heuristic.

What do you mean? When would you cut then? You don't know upfront what writes you expect to see but don't see, no?

@brancz
Copy link
Contributor

brancz commented Mar 31, 2020

The heuristic of allowing inserts up to 0.5x timespan of head blocks is based on the assumption that we can safely and correctly cut blocks at that time, I'm wondering what other strategies there might be. Clearly other databases do different things and time-based things are actually kind of weird in the first place. What I'm trying to say is, if we remove that requirement, then we might be able to think of ways how we can improve this situation (potentially combined with vertical compaction?).

@csmarchbanks
Copy link
Contributor

Maybe we need to discuss back-filling with blocks as a remote write extension

I think I have some code sitting around somewhere that does this (I was using it to populate datasets from Prometheus into various backends that supported remote write). If there is interested I'd be happy to dig it up again.

we need a way to cut a head block safely that doesn't have the 1.5x time requirement/heuristic

Yes, that would be great. There were some ideas around this when we were discussing how to limit Prometheus memory usage weren't there? I remember at least something around space-based head block.

@codesome
Copy link
Contributor

codesome commented Apr 6, 2020

Catching up with emails now :) looks like I missed some discussions

@codesome Given we have vertical compaction in TSDB and we can have overlapping blocks, what would be the implications to allow to write samples "out of bounds" in the TSDB head?

While "out of bound" in TSDB would work fine, it needs some more discussion if it has to be upstream. Also, talking w.r.t. cortex, you will have an unexpected rise in memory consumption because Head block gets bigger than expected. (Additionally, vertical queries and compactions are a tad bit more expensive in terms of CPU and Memory)

I think opening a side block and use vertical compaction would solve it.

Is this idea for upstream Prometheus or Thanos/Cortex? But anyway, do we have any requirement that data is available for querying soon after ingesting?

extend block range (would lead to higher ingester memory usage, and longer WAL replays)

With the m-mapping work that is going on, the memory usage can be taken care of. And if this partial chunks work looks good to maintainers (follow up of m-map work), that would also take care of WAL replays :). But this would mean Cortex can increase it's block range, but the default in upstream Prometheus would need to be changed too so that WAL is kept around longer.

@codesome
Copy link
Contributor

codesome commented Apr 6, 2020

I would as much try to avoid adding samples older than Head mint and bring vertical compaction into play in the upstream Prometheus, because (1) Code is already complex enough (is that a valid argument? :P) (2) If not used it correctly, users will silently lose/corrupt data. (3) Unexpected spikes in CPU/Memory (maybe this should be expected?)

If this could be an optional flag (just like for overlapping data), we can forget about point 2 and 3.

@pracucci
Copy link
Contributor Author

pracucci commented Apr 6, 2020

Also, talking w.r.t. cortex, you will have an unexpected rise in memory consumption because Head block gets bigger than expected.

Is this true even after the m-mapping work is complete?

I think opening a side block and use vertical compaction would solve it.
But anyway, do we have any requirement that data is available for querying soon after ingesting?

Yes, we should be able to immediately query back samples pushed to the ingesters, like it's working for the chunks storage (this issue affects only the blocks storage).

If this could be an optional flag (just like for overlapping data), we can forget about point 2 and 3.

I was thinking about an optional flag (like overlapping data).

@stale
Copy link

stale bot commented Jun 5, 2020

This issue has been automatically marked as stale because it has not had any activity in the past 60 days. It will be closed in 15 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Jun 5, 2020
@pracucci pracucci added the keepalive Skipped by stale bot label Jun 5, 2020
@stale stale bot removed the stale label Jun 5, 2020
@pracucci
Copy link
Contributor Author

pracucci commented Jun 5, 2020

This is still valid. @codesome has some ideas he wanna experiment.

@bwplotka
Copy link
Contributor

bwplotka commented Jun 5, 2020 via email

@codesome
Copy link
Contributor

codesome commented Jun 5, 2020

Can you share those ideas Ganesh?

It was one of my TODO for today :) I will share with you all once I have written it down.

@codesome
Copy link
Contributor

codesome commented Jun 5, 2020

This is an idea from the top of my head, needs more thought

The root problem

These 2 checks, this and this

Solution

Enable vertical compaction and querier and make that check optional - upstream will enable the check whereas Cortex and Thanos can turn that off.

Why does this work?

Because we are not discarding out of bound sample and the sample would just get added to it's series. Out of order samples in a series are still discarded.

If a series lagging behind in time causes overlap with data on disk, the vertical querier will take care of deduping.

After compaction of head, the vertical blocks are given top priority and they will get merged.

Any gotchas? Yes

  • Head compaction is based on the mint and maxt of the Head. One can generate synthetic data to add just 2 samples per series ranging over 3h and cause lots of unnecessary Head compactions. This can be taken care of manually calling Compact() with some logic around it instead of calling it very often.

  • It is possible to add samples with time range spanning a very large time range - this can end up compacting all blocks into one. One possible solution is instead of removing the out of bound check entirely, make the minValidTime configurable so that you can decide how long back in time you want to accept.

@pracucci
Copy link
Contributor Author

pracucci commented Jun 8, 2020

Thanks @codesome for sharing your idea!

As you stated, a downside of this approach is that you may end up with a single large block which will impact the compaction. This is a solvable problem, replacing TSDB compactor planner with a custom one (ie. we could exclude those not-aligned blocks from compaction or compact them together without impacting correctly-aligned blocks).

The longer the time range for such blocks is, the more it's problematic at query time so, as you stated, we may add a limit to the oldest timestamp we do allow to ingest (now - threshold). This would practically make this system not working for backfilling purposes (because the threshold would be in terms of hours, not days/months) but may solve the problem described in this issue.

After compaction of head, the vertical blocks are given top priority and they will get merged.

Ideally we don't want any vertical compaction occur in the ingesters. Vertical compaction will be done by the Cortex compactor later on. What do you think?

@codesome
Copy link
Contributor

The above proposal having multiple TSDBs open for backfilling was discarded because of lot of unnecessary complexity that it brings into the code and the additional memory required in the worst case (for example, if you want to backfill upto 6h old samples, you might need upto 6-7x the memory).

With the scope of this being only to recover from long outage either on the Prometheus side or Cortex, and not actually backfilling old data, here is a simplified design:

Taking an example of allowing samples upto 7h old (this will be configurable), 1h is taken care by the main TSDB, and for the remaining 6h we maintain only 2 TSDBs.

Both the new TSDBs will be able to ingest 6h of data (this is done by keeping the block range to 12h). The time ranges allowed in those blocks will be aligned with 6h (or the configured time).

Here is some visualization (M is the current TSDB. X and Y are the newly introduced TSDBs):

1.
                   <     6h     >
                   |            |---|
                   |            | M |
                   |            |---|
                   |------------|
                   |      X     |
                   |------------|

2. Time is moving, both X and Y will cover that 6h partially.

                   <     6h     >
                   |            |---|
                   |            | M |
                   |            |---|
            |------------|------------|
            |      X     |      Y     |
            |------------|------------|

3. First one goes beyond max configured, hence compact and ship the block

                   <     6h     >
                   |            |---|
                   |            | M |
                   |            |---|
      |------------|------------|
      |      X     |      Y     |
      |------------|------------|

4. The TSDBs are rotated at this point, and the same thing continues.

                   <     6h     >
                   |            |---|
                   |            | M |
                   |            |---|
            |------------|------------|
            |      Y     |      X     |
            |------------|------------|

We have two of them as when one is compacting, the other should be up. And as the time is moving, both together cover that 6h gap.

Small implementation details:

  1. The new TSDBs are created iff any sample is received for those time range and it was not deemed out-of-order in the main TSDB.
  2. The 7h max age is calculated w.r.t. the wall clocks and not the max time of the main TSDB, this is to avoid sudden advancement of that maxt which can cause both old TSDBs to be unavailable as they need to be compacted before they can ingest new data.
  3. The shutdown, flush and shutdown handler, etc, on these new TSDBs, will behave similar to the main TSDB.
  4. Transfers during handover won't be done for these new TSDBs (moreover, handover might be entirely removed for block Proposal: Remove support for TSDB blocks transfer between ingesters #2966)

Resource:

  1. This can take upto 3x the memory if all the tenants plan to send the old data for the max age (with memory-mapping of head chunk, it's only the number of series that matters the most for the memory).
  2. Needs more disk space to support the new TSDBs. Exact/rough numbers yet to be calculated for this.

@brancz
Copy link
Contributor

brancz commented Jul 31, 2020

This can take upto 3x the memory if all the tenants plan to send the old data for the max age (with memory-mapping of head chunk, it's only the number of series that matters the most for the memory).

I don't understand what this sentence means. Does it take 3x memory or is that overhead optimized away by mmaping of head chunks?

@codesome
Copy link
Contributor

Does it take 3x memory or is that overhead optimized away by mmaping of head chunks?

It can take upto 3x more memory if all 3 TSDBs are in action. With mmapping of head, the main factor that decides the memory taken by TSDB is number of series in head and the time range does not matter, because only 1 chunk per series and the labels/index will be stored in the memory at any point of time and remaining chunks on disk.

So in normal usage when you don't need to ingest old data, there is 0 overhead, because no additional TSDB is open. If one more TSDB comes into actions to backfill old data and if it has a similar set of series that of main TSDB, then it's like 2x the memory consumption as before. If third TSDB also comes into action because the samples are spanning a wider range, then 3x in the worst case. And this is all for per tenant memory increase. So to hit 3x memory usage for entire ingesters, all tenants should be sending old data at the same time (if it's just 1 tenant, then the probability increases).

@brancz
Copy link
Contributor

brancz commented Aug 3, 2020

Understood thank you for explaining. I expect that this happens primarily on extended downtime of the backend, so it's rather likely that this happens for many tenants at once no? In such a scenario do you think the resulting thundering herd would be managable?

@itzg
Copy link

itzg commented Aug 3, 2020

In our case the thundering herd is expected since our tenants will demand that the 6 hours of backfill be eventually available to visualize at some point. After an outage, spinning up larger memory pods seems like a reasonable operation to accommodate this design.

Thanks, @codesome , this is looking very promising.

@codesome
Copy link
Contributor

In such a scenario do you think the resulting thundering herd would be managable?

If Cortex is down, ingesting >1h old samples will come into picture only when tenants have multiple Prometheus instances and some instances catch up faster than the others hence pushing the maxt of the main TSDB ahead. If not, it will be a single TSDB as before.

Considering the worst case of all requiring ingestion of >1h old samples, if the ingesters are already provisioned to use at least 3x the normal memory usage, I think things should be fine in terms of memory (with m-mapping the memory used will be capped in absence of queries). But CPU throttling can be expected, driving the latency higher, and maybe leading to distributors OOMing.

This is all considering ideal scenario where user does not have lot of churn. But from what we have learnt from past outages in the non-blocks version of Cortex, it is advised to scale up the ingesters (also the distributors) when the cluster is recovering from an outage to manage the thundering herd.

@bwplotka
Copy link
Contributor

Hey, thanks for this. This can work, but in fact, what we propose here to me is literally multi TSDB (: so: How this is different than multiple TSDBs capped at T-7h hours? Is it block range being 6h?

@pracucci
Copy link
Contributor Author

We're still working on it

@pracucci pracucci removed the keepalive Skipped by stale bot label Sep 28, 2020
@MedoDome
Copy link

MedoDome commented Nov 20, 2020

Any information about this? I would like to add some historical data into Cortex

@pracucci
Copy link
Contributor Author

Any information about this? I would like to add some historical data into Cortex

The approach we took in #3025 should work but has some downsides:

  1. The backfilling code path is exercised only during a backfilling, so bugs may slip in and we may not notice it until a backfilling event happens
  2. The ingester memory utilisation could potentially grow 3x, risking to hit the ingester memory limit and causing an outage. If the backfilling covers the case of recovering after an outage, we may end up with an outage after another one

We're still evaluating pros/cons of #3025, while discussing alternative ideas.

@MedoDome
Copy link

Thank you on the fast reply, I appreciate it

@bboreham bboreham added the keepalive Skipped by stale bot label Feb 4, 2021
@bboreham
Copy link
Contributor

Any update after a few months?

@rajsameer
Copy link

hey guys, We had a similar issues , where when writing metrics to cortex for a new tenant after few minutes the ingester would start throwing out of bound error. The reason we found for this is we mistakenly did not disable alert and rules in our Prometheus and added the same rules and alerts for the tenant in cortex. Now if the alert gets evaluated by the cortex first and then it receives the sample form Prometheus which is also calculating the same rules, ingester would start throwing out of bound error , and this will happen for ever because Prometheus will keep on trying to send the same data and cortex would reject. So just to test this hypothesis we disabled rules and alerts on cortex and remote write is working fine. Should we add a note in the documentation stating not have same rules in Prometheus sand alert manager.

@damnever
Copy link
Contributor

damnever commented Apr 9, 2021

We run into this most likely due to the distributor gets overloaded... this means the middlewares(gateway/proxy) and unstable network may cause this issue.

@bboreham
Copy link
Contributor

bboreham commented Apr 9, 2021

Prometheus which is also calculating the same rules

@rajsameer that doesn't sound like the same thing.
Cortex will reject duplicate data with a 400 error, which means Prometheus will drop it.
For rules, Prometheus will re-evaluate the rule a short while later and send new data, which is also rejected as duplicate and dropped.

Happy to receive PRs about how to configure rules, but it's nothing to do with this issue.

If you have some specific details about your "out of bounds" error please open a new issue so those can be understood.

@alexku7
Copy link
Contributor

alexku7 commented Oct 9, 2021

Hello

Last week, we faced this issue.Cortex refused to accept metrics after some long outage with the error "out of bound".

We had to restart all the prometheus servers sending their data to Cortex. Only restart solved this problem .

Just wanted to ask to any update regarding this issue . Is there any fix or workaround ?

@codesome
Copy link
Contributor

prometheus/prometheus#8535 (comment) <- which will eventually get into Cortex

@bwplotka
Copy link
Contributor

Super simple idea: prometheus/prometheus#9607

Quick win, immidiate value.

@kojderszymon
Copy link

Any update after changes in Prometheus - prometheus/prometheus#11075 ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
keepalive Skipped by stale bot storage/blocks Blocks storage engine
Projects
None yet