Skip to content

feat: Infer partition values from bounds #1079

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Apr 8, 2025

Conversation

jonathanc-n
Copy link
Contributor

@jonathanc-n jonathanc-n commented Mar 13, 2025

Which issue does this PR close?

What changes are included in this PR?

Added API for creating partition struct from statistics

Are these changes tested?

Will add tests after follow up pr for integrating it with the add_parquet_file api

@jonathanc-n jonathanc-n changed the title feat: Infer partition values statistics feat: Infer partition values from statistics Mar 13, 2025
));
}

if lower != upper {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CMIIW, it looks like the possible lower upper can be different, and their partitions are the same. So we need to check transform(lower) != transform(upper)? iceberg-python has the same logic, should we fix it? cc @kevinjqliu @Fokko

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, transform(lower) == transform(upper) doesn't mean the transformed result of each row are all same.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, transform(lower) == transform(upper) doesn't mean the transformed result of each row are all same.

This is interesting. The check here restricts the appended data file to have the same value for partition column. But in spec, the data file only needs to guarantee that the partition value of partition column within single data file is same. e.g. for year(ts), 2015-10-13, 2015-11-13 is ok to exist in single data file I think. But under this restriction, we could not append data file containing these two row, right?
I'm not sure whether worth it, I think there are two ways to avoid this restriction:

  1. Scan whole data file to compute the partition and make sure they are same.
  2. For partition transform, preserve original order properties(I'm not sure whether this description is accurate, e.g year, month), transform(lower) == transform(upper) means the transformed result of each row are all same?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that if the transform preserves order, we can relax the check.

Copy link
Contributor

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @jonathanc-n I'm quite confused about this pr, how can you infer partition value from statistics? First of all, statistics are optional, and they are maybe inaccurate. For example, long string may be truncated. If you want to use them in appending parquet files to table transaction, you need to read partition source columns back and recalculate them.

@ZENOTME
Copy link
Contributor

ZENOTME commented Apr 2, 2025

Hi, @jonathanc-n I'm quite confused about this pr, how can you infer partition value from statistics? First of all, statistics are optional, and they are maybe inaccurate. For example, long string may be truncated. If you want to use them in appending parquet files to table transaction, you need to read partition source columns back and recalculate them.

I think this implementation is refer from pyiceberg, see: https://github.com/apache/iceberg-python/blob/4d4714a46241d0d89519a2a605dbce27b713a60e/pyiceberg/io/pyarrow.py#L2236. It uses lower bound and upper bound to compute the partition. In here this statistics(lower bound, upper bound) is generate when read the parquet file, so we can guarantee that they are valid and accurate I think.🤔

@jonathanc-n
Copy link
Contributor Author

I think the function name is misleading I will change that. We are passing in the lower and upper bounds computed from the original parquet file read during the parquet_to_data_file_builder

@jonathanc-n jonathanc-n changed the title feat: Infer partition values from statistics feat: Infer partition values from bounds Apr 2, 2025
Copy link
Contributor

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jonathanc-n for this pr, and @ZENOTME for review, just one minor bug.


for field in table_spec.fields() {
if let (Some(lower), Some(upper)) = (
lower_bounds.get(&field.field_id),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are checking source value, this should be source_id?

));
}

if lower != upper {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, transform(lower) == transform(upper) doesn't mean the transformed result of each row are all same.

@jonathanc-n
Copy link
Contributor Author

@liurenjie1024 Good catch, should be good now. Thank you for the review!

Copy link
Contributor

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jonathanc-n for this pr, LGTM!

@liurenjie1024 liurenjie1024 merged commit e3ef617 into apache:main Apr 8, 2025
17 checks passed
@jonathanc-n jonathanc-n deleted the add-to-partitioned branch April 8, 2025 17:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants