Ingesters latency and in-flight requests spike right after startup with empty TSDB #3349

pracucci · 2020-10-14T15:09:31Z

Describe the bug
When starting a brand new ingester (empty disk - running blocks storage), as soon as the ingester is registered to the ring and its state switches to ACTIVE, it suddenly receive a bunch of new series. If you target each ingester to have about 1.5M active series, it will have to add 1.5M series to TSDB in a matter of few seconds.

Today, while scaling out a large number of ingesters (50), in few of such ingesters we got a very high latency and a high number of in-flight requests. The high number of in-flight requests caused the memory to increase, until some of these ingesters were OOMKilled.

I've been able to profile the affected ingesters and the following is what I found so far.

1. Number of in-flight push requests skyrocket right after ingester startup

2. The number of TSDB appenders skyrocket too

3. Average cortex_ingester_tsdb_appender_add_duration_seconds skyrocket too

4. Lock contention in `Head.getOrCreateWithID()`

With no big surprise, looking at the number of active goroutines, 99.9% where blocked in Head.getOrCreateWithID() due to lock contention.

To Reproduce
Haven't found a way to easily reproduce it yet locally or with a stress test, but unfortunately looks that it's not that difficult to reproduce in production (where debugging is harder).

Storage Engine

Blocks
Chunks

The text was updated successfully, but these errors were encountered:

bboreham · 2021-04-08T15:16:21Z

This reminds me of #3097.

bboreham · 2021-08-17T16:00:48Z

We have -ingester.instance-limits.max-inflight-push-requests now which will allow the requests to be capped and avoid OOM, however it will still create a lot of noise from error messages and retried requests.

pracucci added type/performance storage/blocks Blocks storage engine labels Oct 14, 2020

bboreham added the bug scrub/last issue discussed label Apr 8, 2021

bboreham removed the bug scrub/last issue discussed label May 20, 2021

This was referenced Aug 17, 2021

when ingester starting, ingester increase thousands goroutines #4393

Open

Hold lock less when creating series prometheus/prometheus#9212

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ingesters latency and in-flight requests spike right after startup with empty TSDB #3349

Ingesters latency and in-flight requests spike right after startup with empty TSDB #3349

pracucci commented Oct 14, 2020

bboreham commented Apr 8, 2021

bboreham commented Aug 17, 2021

Ingesters latency and in-flight requests spike right after startup with empty TSDB #3349

Ingesters latency and in-flight requests spike right after startup with empty TSDB #3349

Comments

pracucci commented Oct 14, 2020

1. Number of in-flight push requests skyrocket right after ingester startup

2. The number of TSDB appenders skyrocket too

3. Average cortex_ingester_tsdb_appender_add_duration_seconds skyrocket too

4. Lock contention in Head.getOrCreateWithID()

bboreham commented Apr 8, 2021

bboreham commented Aug 17, 2021

4. Lock contention in `Head.getOrCreateWithID()`