You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
When starting a brand new ingester (empty disk - running blocks storage), as soon as the ingester is registered to the ring and its state switches to ACTIVE, it suddenly receive a bunch of new series. If you target each ingester to have about 1.5M active series, it will have to add 1.5M series to TSDB in a matter of few seconds.
Today, while scaling out a large number of ingesters (50), in few of such ingesters we got a very high latency and a high number of in-flight requests. The high number of in-flight requests caused the memory to increase, until some of these ingesters were OOMKilled.
I've been able to profile the affected ingesters and the following is what I found so far.
1. Number of in-flight push requests skyrocket right after ingester startup
2. The number of TSDB appenders skyrocket too
3. Average cortex_ingester_tsdb_appender_add_duration_seconds skyrocket too
4. Lock contention in Head.getOrCreateWithID()
With no big surprise, looking at the number of active goroutines, 99.9% where blocked in Head.getOrCreateWithID() due to lock contention.
To Reproduce
Haven't found a way to easily reproduce it yet locally or with a stress test, but unfortunately looks that it's not that difficult to reproduce in production (where debugging is harder).
Storage Engine
Blocks
Chunks
The text was updated successfully, but these errors were encountered:
We have -ingester.instance-limits.max-inflight-push-requests now which will allow the requests to be capped and avoid OOM, however it will still create a lot of noise from error messages and retried requests.
Describe the bug
When starting a brand new ingester (empty disk - running blocks storage), as soon as the ingester is registered to the ring and its state switches to
ACTIVE
, it suddenly receive a bunch of new series. If you target each ingester to have about 1.5M active series, it will have to add 1.5M series to TSDB in a matter of few seconds.Today, while scaling out a large number of ingesters (50), in few of such ingesters we got a very high latency and a high number of in-flight requests. The high number of in-flight requests caused the memory to increase, until some of these ingesters were OOMKilled.
I've been able to profile the affected ingesters and the following is what I found so far.
1. Number of in-flight push requests skyrocket right after ingester startup
2. The number of TSDB appenders skyrocket too
3. Average cortex_ingester_tsdb_appender_add_duration_seconds skyrocket too
4. Lock contention in
Head.getOrCreateWithID()
With no big surprise, looking at the number of active goroutines, 99.9% where blocked in
Head.getOrCreateWithID()
due to lock contention.To Reproduce
Haven't found a way to easily reproduce it yet locally or with a stress test, but unfortunately looks that it's not that difficult to reproduce in production (where debugging is harder).
Storage Engine
The text was updated successfully, but these errors were encountered: