Description
Describe the bug
Cortex after finishing compacting a block marks it to be deleted. The way cortex does that is adding a deletion-marker
at the storage level. Currently, Cortex uploads two markers when marking a block to delete. The first on the block storage level and the second on the global markers folder for the bucket.
cortex/pkg/storage/tsdb/bucketindex/markers_bucket_client.go
Lines 41 to 46 in dd4240d
The code upload first to the block layer and then to the global layer.
From what I could infer of the block_cleaner, the cleaner which delete all blocks marked to be deleted uses the global folder to decide which blocks need to be deleted. It does not check for the block's deletion marker.
Cleaner
https://github.com/cortexproject/cortex/blob/master/pkg/compactor/blocks_cleaner.go#L340
Index update
https://github.com/cortexproject/cortex/blob/master/pkg/storage/tsdb/bucketindex/updater.go#L172
In a situation where the markers_bucket_client upload the marker to the bucket and failed to upload to the global markers, the block would not be deleted.
Also another issue, when marking a block to delete again thanos checks if the marker already exist and if does it will not re upload
https://github.com/thanos-io/thanos/blob/d1405e4a2ec2e7bb47cd46ce5648b5809fa77579/pkg/block/block.go#L178-L187
Not giving a chance to re upload the global marker
Is there a retry that I am missing which solve the issue?
Should we revert the order of the upload markers?
To Reproduce
Steps to reproduce the behavior:
- Start Cortex (SHA or version)
- Perform Operations(Read/Write/Others)
Hard to reproduce as we need to simulate a failure on the storage layer in the correct moment. One possibility is to run Cortex without the line to upload the global marker just as a test
Expected behavior
The bucket to be deleted
Environment:
- Infrastructure: Kubernetes
- Deployment tool: helm