Skip to content

Add all_data_files, all_manifests, and all_entries metadata tables #805

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Feb 17, 2020

Conversation

rdblue
Copy link
Contributor

@rdblue rdblue commented Feb 16, 2020

This adds 3 new metadata tables and tests:

  • all_data_files lists all data files in a table that are accessible from any valid (not expired) snapshot
  • all_entries lists all manifest entries in a table that are accessible from any valid snapshot
  • all_manifests lists all manifest files in a table that are accessible from any valid snapshot

These tables may contain duplicate rows. Deduplication can't be done through the current scan interface unless all of the work is done during scan planning on a single node. Duplicates are the trade-off for being able to process the metadata in parallel for large tables.

Use cases

We recently added the all_data_files and all_manifests tables to enable building services that manage data files. For example, a janitor service that cleans up orphaned or dangling data files needs to be able to list all valid files in a table. Along with the snapshots table that has manifest list locations, all_manifests and all_data_files enable listing all data and metadata files referenced by a table.

We use the all_entries table to detect the last modified time of partitions. This requires knowing when a file was appended or overwritten and requires ignoring later rewrites:

SELECT
    max(s.committed_at) as last_updated_at,
    e.data_file.partition.*
FROM db.table.all_entries e
JOIN db.table.snapshots s
  ON e.snapshot_id = s.snapshot_id
WHERE e.status = 1 AND s.operation IN ('append', 'overwrite')
GROUP BY e.data_file.partition

@rdblue rdblue requested a review from aokolnychyi February 16, 2020 23:39
@rdblue rdblue force-pushed the add-new-metadata-tables branch from 843e3d8 to b446de9 Compare February 17, 2020 01:00
Copy link
Contributor

@danielcweeks danielcweeks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only a couple small nits/comments that apply across all the metadata table implementations. Seems like an easy fix to expose the ability to split sizes which may be an issue for some rare cases (though I wouldn't expect manifest lists/files would typically exceed the default 32MB).

+1

Schema schema = new Schema(DataFile.getType(table.spec().partitionType()).fields());
if (table.spec().fields().size() < 1) {
// avoid returning an empty struct, which is not always supported. instead, drop the partition field (id 102)
return TypeUtil.selectNot(schema, Sets.newHashSet(102));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be a bit out of scope, but shouldn't we have a better way to represent reserved field ids than magic numbers?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can look up the number in the schema by using schema.findColumn("partition").fieldId(), but that's really just looking up a constant value using another constant field name, "partition". I thought it was easier to use a comment to explain what's happening than to do the lookup. Maybe there's an alternative to this I didn't think of? Any idea?


@Override
public String location() {
return table.currentSnapshot().manifestListLocation();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assumes manifest list location?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but it's okay if this is null.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rdblue, a quick question. Why do we use manifestListLocation for AllDataFiles, AllEntriesFiles, DataFiles, ManifestEntries tables and ops.current().file().location() for others?

There is one rare case when this might fail. Our Reader in Spark will use table.location() to obtain a file system object. If the manifest list is null, it will fail.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The check was added recently with locality for HDFS.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For location, I suspect that it's just a copy/paste error. I was much more careful setting locations in the read tasks than for the metadata tables. We can clean that up.

Also, I forgot about the HDFS check. I think we should probably fix the cases where this may be null and return the table location. And we should also make sure that the locality logic doesn't cause a failure. That should be a warning for the optimization only.


@Override
protected long targetSplitSize(TableOperations ops) {
return TARGET_SPLIT_SIZE;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like something we should map to a table/read property \w default as opposed to hard coded.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can add this later if necessary. I'd like to keep these simple and add features as we go.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it generally makes sense to configure the split size for metadata tables. In our tables, it is not uncommon to see 4-6 MB manifest files. If we have a reasonable cluster and allow split sizes to be 16 MB, metadata queries can be 2 times faster.

I've created #817 so that we can discuss it.

@rdblue
Copy link
Contributor Author

rdblue commented Feb 17, 2020

Thanks for the review, @danielcweeks! I'm merging this now that tests are passing after resolving the conflict.

@rdblue rdblue merged commit 65095c1 into apache:master Feb 17, 2020
}
}

static CloseableIterable<ManifestFile> allManifestFiles(List<Snapshot> snapshots) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a couple of places when we define static methods in one metadata table class and call them in others. It seems we can put some of those in the parent BaseMetadataTable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants