-
Notifications
You must be signed in to change notification settings - Fork 816
Blocks storage unable to ingest samples older than 1h after an outage #2366
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@codesome Given we have vertical compaction in TSDB and we can have overlapping blocks, what would be the implications to allow to write samples "out of bounds" in the TSDB head? |
Noting other possible options (not necessarily good options):
|
Really? Can we find code path in Prometheus which does it? |
Also, clock skew can cause it. |
New block is cut from the head, when head (in-memory data) covers more than 1.5x of the block range. For 2 hours block range, it means that head needs to have 3h of data to cut a block. Block "start time" is always the minimum sample time in the head, while block "end time" is aligned on block range boundary. New block always covers single "block range" period. Data stored into a block is then removed from the Head. That means that after cutting the block, head will already have at least 1h of data, but possibly even more. When writing new data via appender, minimum time limit is computed as When head covers less than 0.5 of block range (<1h for 2h block range), samples cannot be older than min time in the head. When head covers more than 0.5 of block range (>1h for 2h block range), samples cannot be older than half block range since max time in the head. |
Thanks for the explanation. I think we are getting into the backfilling area. I think opening a side block and use vertical compaction would solve it. |
We could start Prometheus discussion for it as well to enable that behavior if vertical compaction is enabled for TSDB itself. |
It depends. How long time back is considered backfilling? From the Cortex (or Thanos receive) perspective, I think the issue we're describing here is not considered backfilling. At the current stage, we can't tolerate an outage longer than about 1h or we'll lose data. |
Maybe it means that remote write is not good enough. You only have up to 2 hours to even still have that WAL around in this scenario, at which point it would be cut to a block. Maybe we need to discuss back-filling with blocks as a remote write extension. Just thinking out loud. (I think I briefly discussed this with @csmarchbanks once before) |
Where is the boundary? |
From the pure UX side, if it's about a Prometheus server catching up after an outage then I wouldn't consider it backfilling. |
I guess the boundary is then 1.5x block size (once we exceed WAL (3h))? |
Actually user can change that, so we can make even 2 years WAL. If some one will not upload things for 2y and suddenly want to put 2y old sample, would that be still not backfill 🤔 ? |
Sounds like concretely for this, we need a way to cut a head block safely that doesn't have the 1.5x time requirement/heuristic. |
What do you mean? When would you cut then? You don't know upfront what writes you expect to see but don't see, no? |
The heuristic of allowing inserts up to 0.5x timespan of head blocks is based on the assumption that we can safely and correctly cut blocks at that time, I'm wondering what other strategies there might be. Clearly other databases do different things and time-based things are actually kind of weird in the first place. What I'm trying to say is, if we remove that requirement, then we might be able to think of ways how we can improve this situation (potentially combined with vertical compaction?). |
I think I have some code sitting around somewhere that does this (I was using it to populate datasets from Prometheus into various backends that supported remote write). If there is interested I'd be happy to dig it up again.
Yes, that would be great. There were some ideas around this when we were discussing how to limit Prometheus memory usage weren't there? I remember at least something around space-based head block. |
Catching up with emails now :) looks like I missed some discussions
While "out of bound" in TSDB would work fine, it needs some more discussion if it has to be upstream. Also, talking w.r.t. cortex, you will have an unexpected rise in memory consumption because Head block gets bigger than expected. (Additionally, vertical queries and compactions are a tad bit more expensive in terms of CPU and Memory)
Is this idea for upstream Prometheus or Thanos/Cortex? But anyway, do we have any requirement that data is available for querying soon after ingesting?
With the m-mapping work that is going on, the memory usage can be taken care of. And if this partial chunks work looks good to maintainers (follow up of m-map work), that would also take care of WAL replays :). But this would mean Cortex can increase it's block range, but the default in upstream Prometheus would need to be changed too so that WAL is kept around longer. |
I would as much try to avoid adding samples older than Head mint and bring vertical compaction into play in the upstream Prometheus, because (1) Code is already complex enough (is that a valid argument? :P) (2) If not used it correctly, users will silently lose/corrupt data. (3) Unexpected spikes in CPU/Memory (maybe this should be expected?) If this could be an optional flag (just like for overlapping data), we can forget about point 2 and 3. |
Is this true even after the m-mapping work is complete?
Yes, we should be able to immediately query back samples pushed to the ingesters, like it's working for the chunks storage (this issue affects only the blocks storage).
I was thinking about an optional flag (like overlapping data). |
This issue has been automatically marked as stale because it has not had any activity in the past 60 days. It will be closed in 15 days if no further activity occurs. Thank you for your contributions. |
This is still valid. @codesome has some ideas he wanna experiment. |
Can you share those ideas Ganesh?
…On Fri, 5 Jun 2020 at 09:05, Marco Pracucci ***@***.***> wrote:
*This is still valid. @codesome <https://github.com/codesome> has some
ideas he wanna experiment.*
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2366 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABVA3OZY3M5UICNW2ZTVOQDRVCRNFANCNFSM4LXMKN6Q>
.
|
It was one of my TODO for today :) I will share with you all once I have written it down. |
This is an idea from the top of my head, needs more thought The root problem Solution Enable vertical compaction and querier and make that check optional - upstream will enable the check whereas Cortex and Thanos can turn that off. Why does this work? Because we are not discarding out of bound sample and the sample would just get added to it's series. Out of order samples in a series are still discarded. If a series lagging behind in time causes overlap with data on disk, the vertical querier will take care of deduping. After compaction of head, the vertical blocks are given top priority and they will get merged. Any gotchas? Yes
|
Thanks @codesome for sharing your idea! As you stated, a downside of this approach is that you may end up with a single large block which will impact the compaction. This is a solvable problem, replacing TSDB compactor planner with a custom one (ie. we could exclude those not-aligned blocks from compaction or compact them together without impacting correctly-aligned blocks). The longer the time range for such blocks is, the more it's problematic at query time so, as you stated, we may add a limit to the oldest timestamp we do allow to ingest (now - threshold). This would practically make this system not working for backfilling purposes (because the threshold would be in terms of hours, not days/months) but may solve the problem described in this issue.
Ideally we don't want any vertical compaction occur in the ingesters. Vertical compaction will be done by the Cortex compactor later on. What do you think? |
The above proposal having multiple TSDBs open for backfilling was discarded because of lot of unnecessary complexity that it brings into the code and the additional memory required in the worst case (for example, if you want to backfill upto 6h old samples, you might need upto 6-7x the memory). With the scope of this being only to recover from long outage either on the Prometheus side or Cortex, and not actually backfilling old data, here is a simplified design: Taking an example of allowing samples upto 7h old (this will be configurable), 1h is taken care by the main TSDB, and for the remaining 6h we maintain only 2 TSDBs. Both the new TSDBs will be able to ingest 6h of data (this is done by keeping the block range to 12h). The time ranges allowed in those blocks will be aligned with 6h (or the configured time). Here is some visualization (M is the current TSDB. X and Y are the newly introduced TSDBs):
We have two of them as when one is compacting, the other should be up. And as the time is moving, both together cover that 6h gap. Small implementation details:
Resource:
|
I don't understand what this sentence means. Does it take 3x memory or is that overhead optimized away by mmaping of head chunks? |
It can take upto 3x more memory if all 3 TSDBs are in action. With mmapping of head, the main factor that decides the memory taken by TSDB is number of series in head and the time range does not matter, because only 1 chunk per series and the labels/index will be stored in the memory at any point of time and remaining chunks on disk. So in normal usage when you don't need to ingest old data, there is 0 overhead, because no additional TSDB is open. If one more TSDB comes into actions to backfill old data and if it has a similar set of series that of main TSDB, then it's like 2x the memory consumption as before. If third TSDB also comes into action because the samples are spanning a wider range, then 3x in the worst case. And this is all for per tenant memory increase. So to hit 3x memory usage for entire ingesters, all tenants should be sending old data at the same time (if it's just 1 tenant, then the probability increases). |
Understood thank you for explaining. I expect that this happens primarily on extended downtime of the backend, so it's rather likely that this happens for many tenants at once no? In such a scenario do you think the resulting thundering herd would be managable? |
In our case the thundering herd is expected since our tenants will demand that the 6 hours of backfill be eventually available to visualize at some point. After an outage, spinning up larger memory pods seems like a reasonable operation to accommodate this design. Thanks, @codesome , this is looking very promising. |
If Cortex is down, ingesting >1h old samples will come into picture only when tenants have multiple Prometheus instances and some instances catch up faster than the others hence pushing the maxt of the main TSDB ahead. If not, it will be a single TSDB as before. Considering the worst case of all requiring ingestion of >1h old samples, if the ingesters are already provisioned to use at least 3x the normal memory usage, I think things should be fine in terms of memory (with m-mapping the memory used will be capped in absence of queries). But CPU throttling can be expected, driving the latency higher, and maybe leading to distributors OOMing. This is all considering ideal scenario where user does not have lot of churn. But from what we have learnt from past outages in the non-blocks version of Cortex, it is advised to scale up the ingesters (also the distributors) when the cluster is recovering from an outage to manage the thundering herd. |
Hey, thanks for this. This can work, but in fact, what we propose here to me is literally multi TSDB (: so: How this is different than multiple TSDBs capped at T-7h hours? Is it block range being 6h? |
We're still working on it |
Any information about this? I would like to add some historical data into Cortex |
The approach we took in #3025 should work but has some downsides:
We're still evaluating pros/cons of #3025, while discussing alternative ideas. |
Thank you on the fast reply, I appreciate it |
Any update after a few months? |
hey guys, We had a similar issues , where when writing metrics to cortex for a new tenant after few minutes the ingester would start throwing out of bound error. The reason we found for this is we mistakenly did not disable alert and rules in our Prometheus and added the same rules and alerts for the tenant in cortex. Now if the alert gets evaluated by the cortex first and then it receives the sample form Prometheus which is also calculating the same rules, ingester would start throwing out of bound error , and this will happen for ever because Prometheus will keep on trying to send the same data and cortex would reject. So just to test this hypothesis we disabled rules and alerts on cortex and remote write is working fine. Should we add a note in the documentation stating not have same rules in Prometheus sand alert manager. |
We run into this most likely due to the distributor gets overloaded... this means the middlewares(gateway/proxy) and unstable network may cause this issue. |
@rajsameer that doesn't sound like the same thing. Happy to receive PRs about how to configure rules, but it's nothing to do with this issue. If you have some specific details about your "out of bounds" error please open a new issue so those can be understood. |
Hello Last week, we faced this issue.Cortex refused to accept metrics after some long outage with the error "out of bound". We had to restart all the prometheus servers sending their data to Cortex. Only restart solved this problem . Just wanted to ask to any update regarding this issue . Is there any fix or workaround ? |
prometheus/prometheus#8535 (comment) <- which will eventually get into Cortex |
Super simple idea: prometheus/prometheus#9607 Quick win, immidiate value. |
Any update after changes in Prometheus - prometheus/prometheus#11075 ? |
TSDB doesn't allow to append samples whose timestamp is older than the last block cut from the head. Given a block is cut from the head up until -50% of the max timestamp within the head and given the default block range period is 2h, this means that the blocks storage doesn't allow to append a sample whose timestamp is older than 1h compared to the most recent timestamp in the head.
Let's consider this scenario:
We recently had an outage in our staging environment which triggered this condition and we should find a way to solve it.
@bwplotka You may be interested, given I think this issue affects Thanos receive too.
The text was updated successfully, but these errors were encountered: