Skip to content

Added all_files system table to the Iceberg connector #13424

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

osscm
Copy link
Contributor

@osscm osscm commented Jul 29, 2022

Description

  • This is to add $all_files system table support.
  • Spark already supports the iceberg's $data_files and $all_data_files metadata tables. Trino is already supporting $files.
  • $all_files table will include data_files from all the available snapshots to a table where the $files metadata table includes only the current snapshot one.

Use cases

  • This can be used when the user is debugging data issue. As in case when Trino/Spark is used for metadata/data optimization, then it can modify the metadata and data.

  • Another case is when snapshot is rolled back/future from Trino/Spark and now trying to understand what all data-files are present, and is there any implication because of the optimization/rollback operations.

  • This and $all_manifests can also be used to add the optimization features in Trino like purging orphan files or identifying partitions modified since last time, to implement moving window data-compaction feature. detail
    Where we need to identify the

Design
Adding a new class AllFilesTable and used it as a parent for FilesTable, as both will be implementing similar responsibility,.

Testing
Added a test case in the existing TestIcebergSystemTable.
Thought of using rollback to show case that $all_files can give all the data_files, but then history was getting updated and testHistoryTable depends on the order and operations happen in the testAllFilesTable test. So have not used it. If we think, I can add it.

Is this change a fix, improvement, new feature, refactoring, or other?
Improvement

Is this a change to the core query engine, a connector, client library, or the SPI interfaces? (be specific)
NA

How would you describe this change to a non-technical end user or system administrator?
allows user to access all the data-files referred by the table ( rolled back snapshots, old snapshots). This is helpful for debugging issues like why my query is showing certain data and not current one (in case snapshots are rolled back)

syntax:
select * from "table$all_files"

Related issues, pull requests, and links

Documentation

( ) No documentation is needed.
( ) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.

Release notes

( ) No release notes entries required.
( ) Release notes entries required with the following suggested text:

# Section
* Adds support for `all_files` metadata table, supported by Spark as well. ({issue}`11172`)

@cla-bot cla-bot bot added the cla-signed label Jul 29, 2022
@github-actions github-actions bot added the docs label Jul 29, 2022
@osscm osscm closed this Jul 29, 2022
@osscm osscm deleted the add-all—files branch July 29, 2022 23:43
@osscm osscm restored the add-all—files branch July 29, 2022 23:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

Support Iceberg's all_files Metadata table
2 participants