Add all_data_files, all_manifests, and all_entries metadata tables #805

rdblue · 2020-02-16T23:39:49Z

This adds 3 new metadata tables and tests:

all_data_files lists all data files in a table that are accessible from any valid (not expired) snapshot
all_entries lists all manifest entries in a table that are accessible from any valid snapshot
all_manifests lists all manifest files in a table that are accessible from any valid snapshot

These tables may contain duplicate rows. Deduplication can't be done through the current scan interface unless all of the work is done during scan planning on a single node. Duplicates are the trade-off for being able to process the metadata in parallel for large tables.

Use cases

We recently added the all_data_files and all_manifests tables to enable building services that manage data files. For example, a janitor service that cleans up orphaned or dangling data files needs to be able to list all valid files in a table. Along with the snapshots table that has manifest list locations, all_manifests and all_data_files enable listing all data and metadata files referenced by a table.

We use the all_entries table to detect the last modified time of partitions. This requires knowing when a file was appended or overwritten and requires ignoring later rewrites:

SELECT
    max(s.committed_at) as last_updated_at,
    e.data_file.partition.*
FROM db.table.all_entries e
JOIN db.table.snapshots s
  ON e.snapshot_id = s.snapshot_id
WHERE e.status = 1 AND s.operation IN ('append', 'overwrite')
GROUP BY e.data_file.partition

danielcweeks

Only a couple small nits/comments that apply across all the metadata table implementations. Seems like an easy fix to expose the ability to split sizes which may be an issue for some rare cases (though I wouldn't expect manifest lists/files would typically exceed the default 32MB).

+1

danielcweeks · 2020-02-17T19:15:40Z

core/src/main/java/org/apache/iceberg/AllDataFilesTable.java

+    Schema schema = new Schema(DataFile.getType(table.spec().partitionType()).fields());
+    if (table.spec().fields().size() < 1) {
+      // avoid returning an empty struct, which is not always supported. instead, drop the partition field (id 102)
+      return TypeUtil.selectNot(schema, Sets.newHashSet(102));


This may be a bit out of scope, but shouldn't we have a better way to represent reserved field ids than magic numbers?

We can look up the number in the schema by using schema.findColumn("partition").fieldId(), but that's really just looking up a constant value using another constant field name, "partition". I thought it was easier to use a comment to explain what's happening than to do the lookup. Maybe there's an alternative to this I didn't think of? Any idea?

danielcweeks · 2020-02-17T19:16:17Z

core/src/main/java/org/apache/iceberg/AllDataFilesTable.java

+
+  @Override
+  public String location() {
+    return table.currentSnapshot().manifestListLocation();


Assumes manifest list location?

Yes, but it's okay if this is null.

@rdblue, a quick question. Why do we use manifestListLocation for AllDataFiles, AllEntriesFiles, DataFiles, ManifestEntries tables and ops.current().file().location() for others?

There is one rare case when this might fail. Our Reader in Spark will use table.location() to obtain a file system object. If the manifest list is null, it will fail.

The check was added recently with locality for HDFS.

For location, I suspect that it's just a copy/paste error. I was much more careful setting locations in the read tasks than for the metadata tables. We can clean that up.

Also, I forgot about the HDFS check. I think we should probably fix the cases where this may be null and return the table location. And we should also make sure that the locality logic doesn't cause a failure. That should be a warning for the optimization only.

danielcweeks · 2020-02-17T19:18:09Z

core/src/main/java/org/apache/iceberg/AllDataFilesTable.java

+
+    @Override
+    protected long targetSplitSize(TableOperations ops) {
+      return TARGET_SPLIT_SIZE;


Seems like something we should map to a table/read property \w default as opposed to hard coded.

I think we can add this later if necessary. I'd like to keep these simple and add features as we go.

I believe it generally makes sense to configure the split size for metadata tables. In our tables, it is not uncommon to see 4-6 MB manifest files. If we have a reasonable cluster and allow split sizes to be 16 MB, metadata queries can be 2 times faster.

I've created #817 so that we can discuss it.

rdblue · 2020-02-17T21:52:33Z

Thanks for the review, @danielcweeks! I'm merging this now that tests are passing after resolving the conflict.

aokolnychyi · 2020-02-21T17:32:29Z

core/src/main/java/org/apache/iceberg/AllDataFilesTable.java

+    }
+  }
+
+  static CloseableIterable<ManifestFile> allManifestFiles(List<Snapshot> snapshots) {


There are a couple of places when we define static methods in one metadata table class and call them in others. It seems we can put some of those in the parent BaseMetadataTable.

…pache#805)

rdblue requested a review from aokolnychyi February 16, 2020 23:39

Add all_data_files, all_manifests, and all_entries metadata tables.

b446de9

rdblue force-pushed the add-new-metadata-tables branch from 843e3d8 to b446de9 Compare February 17, 2020 01:00

danielcweeks approved these changes Feb 17, 2020

View reviewed changes

Merge branch 'master' into add-new-metadata-tables

452082c

rdblue merged commit 65095c1 into apache:master Feb 17, 2020

aokolnychyi reviewed Feb 21, 2020

View reviewed changes

aokolnychyi mentioned this pull request Feb 21, 2020

Document all metadata tables #757

Closed

rdblue added a commit to rdblue/iceberg that referenced this pull request Apr 20, 2020

Add all_data_files, all_manifests, and all_entries metadata tables (a…

3e818e0

…pache#805)

osscm mentioned this pull request Feb 26, 2022

Add all_files system table to the Iceberg connector trinodb/trino#11206

Closed

osscm mentioned this pull request Jul 29, 2022

Added all_files system table to the Iceberg connector trinodb/trino#13424

Closed

soumya-ghosh mentioned this pull request Sep 14, 2024

[feat] add missing metadata tables apache/iceberg-python#1053

Open

16 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add all_data_files, all_manifests, and all_entries metadata tables #805

Add all_data_files, all_manifests, and all_entries metadata tables #805

Uh oh!

rdblue commented Feb 16, 2020

Uh oh!

danielcweeks left a comment

Uh oh!

danielcweeks Feb 17, 2020

Uh oh!

rdblue Feb 17, 2020

Uh oh!

danielcweeks Feb 17, 2020

Uh oh!

rdblue Feb 17, 2020

Uh oh!

aokolnychyi Feb 21, 2020

Uh oh!

aokolnychyi Feb 21, 2020

Uh oh!

rdblue Feb 21, 2020

Uh oh!

danielcweeks Feb 17, 2020

Uh oh!

rdblue Feb 17, 2020

Uh oh!

aokolnychyi Feb 21, 2020

Uh oh!

rdblue commented Feb 17, 2020

Uh oh!

aokolnychyi Feb 21, 2020

Uh oh!

Uh oh!

Add all_data_files, all_manifests, and all_entries metadata tables #805

Add all_data_files, all_manifests, and all_entries metadata tables #805

Uh oh!

Conversation

rdblue commented Feb 16, 2020

Use cases

Uh oh!

danielcweeks left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdblue commented Feb 17, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!