Skip to content

Design Proposal: Use multiple buckets for s3 chunk storage for manual cleanup #1594

Closed
@thorfour

Description

@thorfour

Abstract

Add a feature that uses multiple buckets to store spans of data over a given time frame.
Modify table manager to support issuing s3.DeleteBucket requests during cleanup when a given bucket has aged out.

Reason(s)

Presently S3 chunk cleanup is expected to be performed by the S3 storage providers retention policy. Some S3 implementations (ex: DigitalOcean Spaces) do not provide retention policies. This design is a way to keep cortex data more self-contained (not requiring external dependencies to manage the life-cycle of data). It is meant to address issue #1591

Goals

  • Self-contained data life-cycle for cortex chunks.
  • Efficient delete processing (fewest number of API requests)
  • Supporting custom retention periods

Implementation

A new set of flags would be added:
--s3.retain-by-bucket-period that sets the time duration of each bucket
--s3.bucket-prefix that sets the prefix of the buckets that will be created

Ingester will upload chunks to S3 buckets based on the Through timestamp of the chunk, into the given bucket that the chunk would fall into. So it's possible that a chunk that spans bucket boundaries would still be retained by being placed into a newer bucket.

Querier will select the bucket to retrieve chunks from based on the Through timestamp that chunk would place it.

Table-Manager will maintain it's periodic calls to DeleteChunksBefore which for s3 bucket clients will call the s3.DeleteBucket calls to buckets that exceed the retention policy set in table manger.

The complex part of this proposal is the creation life-cycle of the buckets. Since there are multiple ingesters one of them would have to be the "leader" that creates buckets.

Instead I believe table-manger should be responsible for the creation of the buckets as well. Maintaining a rolling window of buckets to ensure there's always an available bucket to write to. Also often with s3 providers bucket names are global. So on creation of a new bucket it would likely be best to be of the format {{bucket_prefix}}-{{uuid}}-{{through_timestamp}}

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions