Ingest samples older than 1h for block store #2819

codesome · 2020-07-01T09:57:05Z

What this PR does:

This PR is currently a super early draft for ingesting samples older than 1h. Lots of TODOs.

The fail in out-of-bound test shows that we are currently appending out-of-bound samples.

Which issue(s) this PR fixes:
Fixes #2366

Checklist

Ingest samples older than 1h
Reload TSDBs on restart
Query samples from older TSDBs
Support query stream over old TSDBs
Automatically cleanup old TSDBs
Tests updated
Documentation added
CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

TODO in future PRs

Metrics from new TSDBs
Limits on the additional TSDB
Automatically cleanup stale TSDBs

bboreham · 2020-07-01T10:23:54Z

Should probably clarify in the title this is for block store - chunks has ingested old samples forever.

pracucci

Thanks @codesome for working on this! I did a very quick first pass and left some design feedback. My suggestion is: try to keep it as simplest and cleanest as possible. It's fine if you need to submit preliminary refactorings to existing code to simplify this PR, but let's try to come up with a clean design.

Please also remember:

Add a CHANGELOG entry
I would allow to disable backfilling setting a 0 value for the "max age" (left a dedicated comment)

Limits on the additional TSDB

I would leave this outside of this PR. It's fine adding them separately to keep this PR a bit smaller.

pkg/storage/tsdb/config.go

pkg/ingester/ingester_v2.go

Signed-off-by: Ganesh Vernekar <[email protected]>

…fixes. Signed-off-by: Ganesh Vernekar <[email protected]>

codesome · 2020-07-30T08:15:27Z

The code looks more complex than I desire it to be. I am currently writing more units tests and once I am satisfied with that, I will spend some time in simplifying the code wherever possible before moving onto manual tests.

…ting blocks. Signed-off-by: Ganesh Vernekar <[email protected]>

pracucci

Another partial review, sorry. I think this logic is too much complicated. I stopped reviewing it because I'm wondering if it's worth all such complexity to fix the issue we're trying to fix. Let's talk offline.

Things I would like to discuss (don't jump on coding it immediately):

The transfer doesn't support backfill TSDBs. I'm fine with that (I believe we shouldn't), but this made me think if we could simplify the shutdown procedure and actually always snapshot and ship backfill blocks to the storage at shutdown.

pracucci · 2020-07-30T12:05:08Z

pkg/storage/tsdb/config.go

@@ -146,6 +147,8 @@ type TSDBConfig struct {
 	StripeSize                int           `yaml:"stripe_size"`
 	WALCompressionEnabled     bool          `yaml:"wal_compression_enabled"`
 	FlushBlocksOnShutdown     bool          `yaml:"flush_blocks_on_shutdown"`
+	BackfillDir               string        `yaml:"backfill_dir"`
+	BackfillLimit             time.Duration `yaml:"backfill_limit"`


The CLI flag is called backfill-max-age while the YAML config option is backfill_limit. We should keep them consistent. Between the two I believe backfill_max_age is more clear to understand. I would rename the BackfillLimit variable accordingly.

pracucci · 2020-07-30T12:11:42Z

pkg/ingester/ingester_v2.go

@@ -119,10 +123,42 @@ func (u *userTSDB) setLastUpdate(t time.Time) {
 	u.lastUpdate.Store(t.Unix())
 }

+func (u *userTSDB) getShippedBlocksULID() ([]ulid.ULID, error) {
+	b, err := ioutil.ReadFile(filepath.Join(u.Dir(), shipper.MetaFilename))


Two things:

I would use shipper.ReadMetaFile() to simplify this function

If the file does not exists, it should return an empty list of block IDs and no error (it's an OK error if no block has been shipped yet)

pracucci · 2020-07-30T12:16:44Z

pkg/ingester/ingester_v2.go

@@ -274,6 +311,11 @@ func (i *Ingester) startingV2(ctx context.Context) error {
 		return errors.Wrap(err, "opening existing TSDBs")
 	}

+	// Scan and open backfill TSDB's that already exist on disk.
+	if err := i.openExistingBackfillTSDB(context.Background()); err != nil {


You should pass ctx, not context.Background().

pracucci · 2020-07-30T12:17:29Z

pkg/ingester/ingester_v2_backfill.go

+}
+
+func (i *Ingester) openExistingBackfillTSDB(ctx context.Context) error {
+	level.Info(util.Logger).Log("msg", "opening existing TSDBs")


Suggested change

level.Info(util.Logger).Log("msg", "opening existing TSDBs")

level.Info(util.Logger).Log("msg", "opening existing backfill TSDBs")

pracucci · 2020-07-30T12:21:26Z

pkg/ingester/ingester_v2_backfill.go

+		}
+
+		userID := u.Name()
+		userPath := filepath.Join(i.cfg.BlocksStorageConfig.TSDB.BackfillDir, userID)


I guess you've added the function BackfillBlocksDir() on the config specifically for this 😉

pracucci · 2020-07-30T12:34:38Z

pkg/ingester/ingester_v2_backfill.go

+		buckets.Unlock()
+	}
+	i.TSDBState.backfillDBs.tsdbsMtx.Unlock()
+


At this point, we should log the success/error similarly to what we do in openExistingTSDB()

pracucci · 2020-07-30T13:13:28Z

pkg/ingester/ingester_v2.go

+	wg.Add(1)
+	go func() {
+		defer wg.Done()
+		i.closeAllBackfillTSDBs()
+	}()
+


I think we should do it after i.userStatesMtx.Unlock(). Also to further simplify the code, I would run wg.Wait() to wait until all user TSDBs are closed, and then I would call closeAllBackfillTSDBs() outside of any go routine.

pracucci · 2020-07-30T13:24:35Z

pkg/ingester/ingester_v2_backfill.go

+	})
+}
+
+func (i *Ingester) runConcurrentBackfillWorkers(ctx context.Context, concurrency int, userFunc func(*userTSDB)) {


I think runConcurrentBackfillWorkers() and runConcurrentUserWorkers() could be generalised into a single function like this:

func runConcurrentUserWorkers(ctx context.Context, userIDs []string, concurrency int, userFunc func(userID string))

Then:

You need two functions to generate the list of user IDs. We already have i.getTSDBUsers() then you could do the same for the backfill TSDBs.

The callback passed to runConcurrentBackfillWorkers() but looks easy passing the userID to the callback and then having the callback fetching the backfilling TSDBs. I understand the actual parallelisation is not the exact same, but we should strive for simplicity first (without compromising correctness) and then we can always optimise it. As already mentioned a bunch of times, I expect backfilling to be an uncommon use case (I don't want to keep multiple TSDBs open for every tenant any time)

This is a refactoring that could be done in a preliminary PR.

pracucci · 2020-07-30T13:31:52Z

pkg/ingester/ingester_v2.go

+			cause := errors.Cause(err)
+			if cause == storage.ErrOutOfBounds &&
+				i.cfg.BlocksStorageConfig.TSDB.BackfillLimit > 0 &&
+				s.TimestampMs > db.Head().MaxTime()-i.cfg.BlocksStorageConfig.TSDB.BackfillLimit.Milliseconds() {


There could be any case when db.Head().MaxTime() math.MinInt64?

But should be based on head max time or time.Now()? The "max age" could be also see an compared to "now". All other time-based limits we have are based on "now".

pracucci · 2020-07-30T13:34:26Z

pkg/ingester/ingester_v2.go

@@ -495,8 +552,13 @@ func (i *Ingester) v2Push(ctx context.Context, req *client.WriteRequest) (*clien
 			}

 			// The error looks an issue on our side, so we should rollback
-			if rollbackErr := app.Rollback(); rollbackErr != nil {
-				level.Warn(util.Logger).Log("msg", "failed to rollback on error", "user", userID, "err", rollbackErr)
+			var merr tsdb_errors.MultiError


Renaming merr to rollbackErr as was previously could help to clarify what this error is about.

codesome · 2020-08-25T15:19:45Z

Superseded by #3025

pull-request-size bot added the size/L label Jul 1, 2020

codesome changed the title ~~Ingest samples older than 1h~~ Ingest samples older than 1h for block store Jul 1, 2020

codesome marked this pull request as draft July 1, 2020 10:27

pracucci reviewed Jul 2, 2020

View reviewed changes

pull-request-size bot added size/XL and removed size/L labels Jul 6, 2020

Ganesh Vernekar added 5 commits July 23, 2020 12:16

Out of order appends

858526c

Signed-off-by: Ganesh Vernekar <[email protected]>

More robust bucket selection/creation. Remove bucket ID concept.

4d1909b

Signed-off-by: Ganesh Vernekar <[email protected]>

Reload TSDBs on restart (untested)

00ee0ef

Signed-off-by: Ganesh Vernekar <[email protected]>

Add querying on the backfill TSDBs

d226cdd

Signed-off-by: Ganesh Vernekar <[email protected]>

Fix review comments

013e7e4

Signed-off-by: Ganesh Vernekar <[email protected]>

codesome force-pushed the ingest-ooold branch from 472f331 to 095dd75 Compare July 23, 2020 09:31

Introduce appender for backfilling to avoid 1 appender per sample

3a82ce3

Signed-off-by: Ganesh Vernekar <[email protected]>

codesome force-pushed the ingest-ooold branch from 095dd75 to 3a82ce3 Compare July 23, 2020 09:57

Ganesh Vernekar added 3 commits July 23, 2020 15:34

Query backfill data in v2QueryStream

6696c06

Signed-off-by: Ganesh Vernekar <[email protected]>

Unit test for push and query

ed6cf63

Signed-off-by: Ganesh Vernekar <[email protected]>

Fix backfill TSDB replay with tests for it

038760e

Signed-off-by: Ganesh Vernekar <[email protected]>

codesome force-pushed the ingest-ooold branch from 843e283 to 038760e Compare July 24, 2020 07:23

Ganesh Vernekar added 4 commits July 24, 2020 12:54

Update docs

036bbe6

Signed-off-by: Ganesh Vernekar <[email protected]>

Merge remote-tracking branch 'upstream/master' into ingest-ooold

8d7fc72

Support flushing of backfill TSDBs and more granular locking

a1c3352

Signed-off-by: Ganesh Vernekar <[email protected]>

Close and ship old backfill TSDBs

d53c545

Signed-off-by: Ganesh Vernekar <[email protected]>

codesome force-pushed the ingest-ooold branch from e85d213 to d53c545 Compare July 28, 2020 14:38

Fix TestIngesterV2BackfillPushAndQuery

5367451

Signed-off-by: Ganesh Vernekar <[email protected]>

codesome force-pushed the ingest-ooold branch 3 times, most recently from 225a602 to fd1fa7f Compare July 29, 2020 06:17

Ganesh Vernekar added 2 commits July 29, 2020 12:06

Merge remote-tracking branch 'upstream/master' into ingest-ooold

6299bc6

Signed-off-by: Ganesh Vernekar <[email protected]>

Fix tests

a001b8a

Signed-off-by: Ganesh Vernekar <[email protected]>

codesome force-pushed the ingest-ooold branch from fd1fa7f to a001b8a Compare July 29, 2020 07:08

Compact, ship, and delete backfill TSDBs properly

1f5c8d0

Signed-off-by: Ganesh Vernekar <[email protected]>

pull-request-size bot added size/XXL and removed size/XL labels Jul 29, 2020

Ganesh Vernekar added 2 commits July 29, 2020 14:37

Check for unshipped blocks before deleting

852f415

Signed-off-by: Ganesh Vernekar <[email protected]>

Test old block compaction. Bunch of sorting, infinite loop and other …

10ad04d

…fixes. Signed-off-by: Ganesh Vernekar <[email protected]>

codesome force-pushed the ingest-ooold branch from 611bee6 to a5c8e92 Compare July 30, 2020 10:04

Fix compaction/shipping/deletion. Add test case for shipping and dele…

4610fe5

…ting blocks. Signed-off-by: Ganesh Vernekar <[email protected]>

codesome force-pushed the ingest-ooold branch from a5c8e92 to 4610fe5 Compare July 30, 2020 11:44

pracucci reviewed Jul 30, 2020

View reviewed changes

codesome mentioned this pull request Aug 12, 2020

Ingest samples older than 1h for block store #3025

Closed

12 tasks

codesome closed this Aug 25, 2020

	level.Info(util.Logger).Log("msg", "opening existing TSDBs")
	level.Info(util.Logger).Log("msg", "opening existing backfill TSDBs")

Ingest samples older than 1h for block store #2819

Ingest samples older than 1h for block store #2819

Uh oh!

Conversation

codesome commented Jul 1, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bboreham commented Jul 1, 2020

Uh oh!

pracucci left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codesome commented Jul 30, 2020

Uh oh!

pracucci left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codesome commented Aug 25, 2020

Uh oh!

Uh oh!

codesome commented Jul 1, 2020 •

edited

Loading