Skip to content

Ingesters latency and in-flight requests spike right after startup with empty TSDB #3349

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
1 of 2 tasks
pracucci opened this issue Oct 14, 2020 · 2 comments
Open
1 of 2 tasks
Labels

Comments

@pracucci
Copy link
Contributor

Describe the bug
When starting a brand new ingester (empty disk - running blocks storage), as soon as the ingester is registered to the ring and its state switches to ACTIVE, it suddenly receive a bunch of new series. If you target each ingester to have about 1.5M active series, it will have to add 1.5M series to TSDB in a matter of few seconds.

Today, while scaling out a large number of ingesters (50), in few of such ingesters we got a very high latency and a high number of in-flight requests. The high number of in-flight requests caused the memory to increase, until some of these ingesters were OOMKilled.

I've been able to profile the affected ingesters and the following is what I found so far.

1. Number of in-flight push requests skyrocket right after ingester startup

Screenshot 2020-10-14 at 17 04 02

2. The number of TSDB appenders skyrocket too

Screenshot 2020-10-14 at 17 02 59

3. Average cortex_ingester_tsdb_appender_add_duration_seconds skyrocket too

Screenshot 2020-10-14 at 17 06 17

4. Lock contention in Head.getOrCreateWithID()

With no big surprise, looking at the number of active goroutines, 99.9% where blocked in Head.getOrCreateWithID() due to lock contention.

Screenshot 2020-10-14 at 12 55 34

To Reproduce
Haven't found a way to easily reproduce it yet locally or with a stress test, but unfortunately looks that it's not that difficult to reproduce in production (where debugging is harder).

Storage Engine

  • Blocks
  • Chunks
@pracucci pracucci added type/performance storage/blocks Blocks storage engine labels Oct 14, 2020
@bboreham
Copy link
Contributor

bboreham commented Apr 8, 2021

This reminds me of #3097.

@bboreham
Copy link
Contributor

We have -ingester.instance-limits.max-inflight-push-requests now which will allow the requests to be capped and avoid OOM, however it will still create a lot of noise from error messages and retried requests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants