-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Add all_data_files, all_manifests, and all_entries metadata tables #805
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
843e3d8
to
b446de9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only a couple small nits/comments that apply across all the metadata table implementations. Seems like an easy fix to expose the ability to split sizes which may be an issue for some rare cases (though I wouldn't expect manifest lists/files would typically exceed the default 32MB).
+1
Schema schema = new Schema(DataFile.getType(table.spec().partitionType()).fields()); | ||
if (table.spec().fields().size() < 1) { | ||
// avoid returning an empty struct, which is not always supported. instead, drop the partition field (id 102) | ||
return TypeUtil.selectNot(schema, Sets.newHashSet(102)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may be a bit out of scope, but shouldn't we have a better way to represent reserved field ids than magic numbers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can look up the number in the schema by using schema.findColumn("partition").fieldId()
, but that's really just looking up a constant value using another constant field name, "partition"
. I thought it was easier to use a comment to explain what's happening than to do the lookup. Maybe there's an alternative to this I didn't think of? Any idea?
|
||
@Override | ||
public String location() { | ||
return table.currentSnapshot().manifestListLocation(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assumes manifest list location?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but it's okay if this is null.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rdblue, a quick question. Why do we use manifestListLocation
for AllDataFiles
, AllEntriesFiles
, DataFiles
, ManifestEntries
tables and ops.current().file().location()
for others?
There is one rare case when this might fail. Our Reader
in Spark will use table.location()
to obtain a file system object. If the manifest list is null, it will fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The check was added recently with locality for HDFS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For location, I suspect that it's just a copy/paste error. I was much more careful setting locations in the read tasks than for the metadata tables. We can clean that up.
Also, I forgot about the HDFS check. I think we should probably fix the cases where this may be null and return the table location. And we should also make sure that the locality logic doesn't cause a failure. That should be a warning for the optimization only.
|
||
@Override | ||
protected long targetSplitSize(TableOperations ops) { | ||
return TARGET_SPLIT_SIZE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like something we should map to a table/read property \w default as opposed to hard coded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can add this later if necessary. I'd like to keep these simple and add features as we go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe it generally makes sense to configure the split size for metadata tables. In our tables, it is not uncommon to see 4-6 MB manifest files. If we have a reasonable cluster and allow split sizes to be 16 MB, metadata queries can be 2 times faster.
I've created #817 so that we can discuss it.
Thanks for the review, @danielcweeks! I'm merging this now that tests are passing after resolving the conflict. |
} | ||
} | ||
|
||
static CloseableIterable<ManifestFile> allManifestFiles(List<Snapshot> snapshots) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a couple of places when we define static methods in one metadata table class and call them in others. It seems we can put some of those in the parent BaseMetadataTable
.
This adds 3 new metadata tables and tests:
all_data_files
lists all data files in a table that are accessible from any valid (not expired) snapshotall_entries
lists all manifest entries in a table that are accessible from any valid snapshotall_manifests
lists all manifest files in a table that are accessible from any valid snapshotThese tables may contain duplicate rows. Deduplication can't be done through the current scan interface unless all of the work is done during scan planning on a single node. Duplicates are the trade-off for being able to process the metadata in parallel for large tables.
Use cases
We recently added the
all_data_files
andall_manifests
tables to enable building services that manage data files. For example, a janitor service that cleans up orphaned or dangling data files needs to be able to list all valid files in a table. Along with thesnapshots
table that has manifest list locations,all_manifests
andall_data_files
enable listing all data and metadata files referenced by a table.We use the
all_entries
table to detect the last modified time of partitions. This requires knowing when a file was appended or overwritten and requires ignoring later rewrites: