Added all_files system table to the Iceberg connector #13424

osscm · 2022-07-29T23:36:57Z

Description

This is to add $all_files system table support.
Spark already supports the iceberg's $data_files and $all_data_files metadata tables. Trino is already supporting $files.
$all_files table will include data_files from all the available snapshots to a table where the $files metadata table includes only the current snapshot one.

Use cases

This can be used when the user is debugging data issue. As in case when Trino/Spark is used for metadata/data optimization, then it can modify the metadata and data.
Another case is when snapshot is rolled back/future from Trino/Spark and now trying to understand what all data-files are present, and is there any implication because of the optimization/rollback operations.
This and $all_manifests can also be used to add the optimization features in Trino like purging orphan files or identifying partitions modified since last time, to implement moving window data-compaction feature. detail
Where we need to identify the

Design
Adding a new class AllFilesTable and used it as a parent for FilesTable, as both will be implementing similar responsibility,.

Testing
Added a test case in the existing TestIcebergSystemTable.
Thought of using rollback to show case that $all_files can give all the data_files, but then history was getting updated and testHistoryTable depends on the order and operations happen in the testAllFilesTable test. So have not used it. If we think, I can add it.

Is this change a fix, improvement, new feature, refactoring, or other?
Improvement

Is this a change to the core query engine, a connector, client library, or the SPI interfaces? (be specific)
NA

How would you describe this change to a non-technical end user or system administrator?
allows user to access all the data-files referred by the table ( rolled back snapshots, old snapshots). This is helpful for debugging issues like why my query is showing certain data and not current one (in case snapshots are rolled back)

syntax:
select * from "table$all_files"

Related issues, pull requests, and links

Fixes 11172

Documentation

( ) No documentation is needed.
( ) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.

Release notes

( ) No release notes entries required.
( ) Release notes entries required with the following suggested text:

# Section
* Adds support for `all_files` metadata table, supported by Spark as well. ({issue}`11172`)

osscm and others added 3 commits July 29, 2022 15:51

Added AbstractFilesTable, to support AllFilesTable

1665119

Added all_files Iceberg metadata table

06c9ec8

empty

46effbd

cla-bot bot added the cla-signed label Jul 29, 2022

github-actions bot added the docs label Jul 29, 2022

osscm closed this Jul 29, 2022

osscm deleted the add-all—files branch July 29, 2022 23:43

osscm restored the add-all—files branch July 29, 2022 23:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Added all_files system table to the Iceberg connector #13424

Added all_files system table to the Iceberg connector #13424

Uh oh!

osscm commented Jul 29, 2022

Uh oh!

Uh oh!

Added all_files system table to the Iceberg connector #13424

Added all_files system table to the Iceberg connector #13424

Uh oh!

Conversation

osscm commented Jul 29, 2022

Description

Related issues, pull requests, and links

Documentation

Release notes

Uh oh!

Uh oh!