-
Notifications
You must be signed in to change notification settings - Fork 266
feat: Infer partition values from bounds #1079
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
)); | ||
} | ||
|
||
if lower != upper { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CMIIW, it looks like the possible lower upper can be different, and their partitions are the same. So we need to check transform(lower) != transform(upper)
? iceberg-python has the same logic, should we fix it? cc @kevinjqliu @Fokko
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so, transform(lower) == transform(upper)
doesn't mean the transformed result of each row are all same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so, transform(lower) == transform(upper) doesn't mean the transformed result of each row are all same.
This is interesting. The check here restricts the appended data file to have the same value for partition column. But in spec, the data file only needs to guarantee that the partition value of partition column within single data file is same. e.g. for year(ts)
, 2015-10-13
, 2015-11-13
is ok to exist in single data file I think. But under this restriction, we could not append data file containing these two row, right?
I'm not sure whether worth it, I think there are two ways to avoid this restriction:
- Scan whole data file to compute the partition and make sure they are same.
- For partition transform, preserve original order properties(I'm not sure whether this description is accurate, e.g year, month),
transform(lower) == transform(upper)
means the transformed result of each row are all same?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that if the transform preserves order, we can relax the check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @jonathanc-n I'm quite confused about this pr, how can you infer partition value from statistics? First of all, statistics are optional, and they are maybe inaccurate. For example, long string may be truncated. If you want to use them in appending parquet files to table transaction, you need to read partition source columns back and recalculate them.
I think this implementation is refer from pyiceberg, see: https://github.com/apache/iceberg-python/blob/4d4714a46241d0d89519a2a605dbce27b713a60e/pyiceberg/io/pyarrow.py#L2236. It uses lower bound and upper bound to compute the partition. In here this statistics(lower bound, upper bound) is generate when read the parquet file, so we can guarantee that they are valid and accurate I think.🤔 |
I think the function name is misleading I will change that. We are passing in the lower and upper bounds computed from the original parquet file read during the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jonathanc-n for this pr, and @ZENOTME for review, just one minor bug.
|
||
for field in table_spec.fields() { | ||
if let (Some(lower), Some(upper)) = ( | ||
lower_bounds.get(&field.field_id), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are checking source value, this should be source_id?
)); | ||
} | ||
|
||
if lower != upper { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so, transform(lower) == transform(upper)
doesn't mean the transformed result of each row are all same.
@liurenjie1024 Good catch, should be good now. Thank you for the review! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jonathanc-n for this pr, LGTM!
Which issue does this PR close?
What changes are included in this PR?
Added API for creating partition struct from statistics
Are these changes tested?
Will add tests after follow up pr for integrating it with the add_parquet_file api