Added all_files system table to the Iceberg connector #13424
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
$all_files
system table support.$data_files
and$all_data_files
metadata tables. Trino is already supporting$files
.$all_files
table will include data_files from all the available snapshots to a table where the$files
metadata table includes only the current snapshot one.Use cases
This can be used when the user is debugging data issue. As in case when Trino/Spark is used for metadata/data optimization, then it can modify the metadata and data.
Another case is when snapshot is rolled back/future from Trino/Spark and now trying to understand what all data-files are present, and is there any implication because of the optimization/rollback operations.
This and
$all_manifests
can also be used to add the optimization features in Trino like purging orphan files or identifying partitions modified since last time, to implement moving window data-compaction feature. detailWhere we need to identify the
Design
Adding a new class
AllFilesTable
and used it as a parent forFilesTable
, as both will be implementing similar responsibility,.Testing
Added a test case in the existing
TestIcebergSystemTable
.Thought of using
rollback
to show case that$all_files
can give all the data_files, but then history was getting updated andtestHistoryTable
depends on the order and operations happen in thetestAllFilesTable
test. So have not used it. If we think, I can add it.syntax:
select * from "table$all_files"
Related issues, pull requests, and links
Documentation
( ) No documentation is needed.
( ) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.
Release notes
( ) No release notes entries required.
( ) Release notes entries required with the following suggested text: