feat: Infer partition values from bounds #1079

jonathanc-n · 2025-03-13T01:53:33Z

Which issue does this PR close?

Part of Support adding parquet files to partitioned table #1035.

What changes are included in this PR?

Added API for creating partition struct from statistics

Are these changes tested?

Will add tests after follow up pr for integrating it with the add_parquet_file api

ZENOTME · 2025-04-02T05:14:43Z

crates/iceberg/src/writer/file_writer/parquet_writer.rs

+                    ));
+                }
+
+                if lower != upper {


CMIIW, it looks like the possible lower upper can be different, and their partitions are the same. So we need to check transform(lower) != transform(upper)? iceberg-python has the same logic, should we fix it? cc @kevinjqliu @Fokko

I don't think so, transform(lower) == transform(upper) doesn't mean the transformed result of each row are all same.

I don't think so, transform(lower) == transform(upper) doesn't mean the transformed result of each row are all same.

This is interesting. The check here restricts the appended data file to have the same value for partition column. But in spec, the data file only needs to guarantee that the partition value of partition column within single data file is same. e.g. for year(ts), 2015-10-13, 2015-11-13 is ok to exist in single data file I think. But under this restriction, we could not append data file containing these two row, right?
I'm not sure whether worth it, I think there are two ways to avoid this restriction:

Scan whole data file to compute the partition and make sure they are same.

For partition transform, preserve original order properties(I'm not sure whether this description is accurate, e.g year, month), transform(lower) == transform(upper) means the transformed result of each row are all same?

I agree that if the transform preserves order, we can relax the check.

liurenjie1024

Hi, @jonathanc-n I'm quite confused about this pr, how can you infer partition value from statistics? First of all, statistics are optional, and they are maybe inaccurate. For example, long string may be truncated. If you want to use them in appending parquet files to table transaction, you need to read partition source columns back and recalculate them.

ZENOTME · 2025-04-02T13:53:45Z

Hi, @jonathanc-n I'm quite confused about this pr, how can you infer partition value from statistics? First of all, statistics are optional, and they are maybe inaccurate. For example, long string may be truncated. If you want to use them in appending parquet files to table transaction, you need to read partition source columns back and recalculate them.

I think this implementation is refer from pyiceberg, see: https://github.com/apache/iceberg-python/blob/4d4714a46241d0d89519a2a605dbce27b713a60e/pyiceberg/io/pyarrow.py#L2236. It uses lower bound and upper bound to compute the partition. In here this statistics(lower bound, upper bound) is generate when read the parquet file, so we can guarantee that they are valid and accurate I think.🤔

jonathanc-n · 2025-04-02T21:44:01Z

I think the function name is misleading I will change that. We are passing in the lower and upper bounds computed from the original parquet file read during the parquet_to_data_file_builder

…ceberg-rust into add-to-partitioned

liurenjie1024

Thanks @jonathanc-n for this pr, and @ZENOTME for review, just one minor bug.

liurenjie1024 · 2025-04-07T13:29:08Z

crates/iceberg/src/writer/file_writer/parquet_writer.rs

+
+        for field in table_spec.fields() {
+            if let (Some(lower), Some(upper)) = (
+                lower_bounds.get(&field.field_id),


If we are checking source value, this should be source_id?

liurenjie1024 · 2025-04-07T13:34:38Z

crates/iceberg/src/writer/file_writer/parquet_writer.rs

+                    ));
+                }
+
+                if lower != upper {


I don't think so, transform(lower) == transform(upper) doesn't mean the transformed result of each row are all same.

jonathanc-n · 2025-04-07T23:45:32Z

@liurenjie1024 Good catch, should be good now. Thank you for the review!

liurenjie1024

Thanks @jonathanc-n for this pr, LGTM!

feat: Infer partition values statistics

0eb32af

jonathanc-n changed the title ~~feat: Infer partition values statistics~~ feat: Infer partition values from statistics Mar 13, 2025

jonathanc-n added 4 commits March 12, 2025 21:55

clippy

871e1cf

Merge branch 'main' into add-to-partitioned

434c2f5

Merge branch 'main' into add-to-partitioned

68176b5

Merge branch 'main' into add-to-partitioned

6df1942

ZENOTME reviewed Apr 2, 2025

View reviewed changes

liurenjie1024 reviewed Apr 2, 2025

View reviewed changes

Merge branch 'main' into add-to-partitioned

27fc18e

jonathanc-n added 2 commits April 2, 2025 17:45

change function name

f6dd3f6

Merge branch 'add-to-partitioned' of https://github.com/jonathanc-n/i…

0efd2e6

…ceberg-rust into add-to-partitioned

jonathanc-n changed the title ~~feat: Infer partition values from statistics~~ feat: Infer partition values from bounds Apr 2, 2025

fmt

a060a1f

liurenjie1024 reviewed Apr 7, 2025

View reviewed changes

jonathanc-n added 2 commits April 7, 2025 19:44

fix

7c59336

Merge branch 'main' into add-to-partitioned

e634857

liurenjie1024 approved these changes Apr 8, 2025

View reviewed changes

liurenjie1024 merged commit e3ef617 into apache:main Apr 8, 2025
17 checks passed

jonathanc-n deleted the add-to-partitioned branch April 8, 2025 17:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Infer partition values from bounds #1079

feat: Infer partition values from bounds #1079

Uh oh!

jonathanc-n commented Mar 13, 2025 •

edited

Loading

Uh oh!

ZENOTME Apr 2, 2025

Uh oh!

liurenjie1024 Apr 7, 2025

Uh oh!

ZENOTME Apr 11, 2025

Uh oh!

liurenjie1024 Apr 12, 2025

Uh oh!

liurenjie1024 left a comment

Uh oh!

ZENOTME commented Apr 2, 2025

Uh oh!

jonathanc-n commented Apr 2, 2025

Uh oh!

liurenjie1024 left a comment

Uh oh!

liurenjie1024 Apr 7, 2025

Uh oh!

liurenjie1024 Apr 7, 2025

Uh oh!

jonathanc-n commented Apr 7, 2025

Uh oh!

liurenjie1024 left a comment

Uh oh!

Uh oh!

Uh oh!

feat: Infer partition values from bounds #1079

feat: Infer partition values from bounds #1079

Uh oh!

Conversation

jonathanc-n commented Mar 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

What changes are included in this PR?

Are these changes tested?

Uh oh!

ZENOTME Apr 2, 2025

Choose a reason for hiding this comment

Uh oh!

liurenjie1024 Apr 7, 2025

Choose a reason for hiding this comment

Uh oh!

ZENOTME Apr 11, 2025

Choose a reason for hiding this comment

Uh oh!

liurenjie1024 Apr 12, 2025

Choose a reason for hiding this comment

Uh oh!

liurenjie1024 left a comment

Choose a reason for hiding this comment

Uh oh!

ZENOTME commented Apr 2, 2025

Uh oh!

jonathanc-n commented Apr 2, 2025

Uh oh!

liurenjie1024 left a comment

Choose a reason for hiding this comment

Uh oh!

liurenjie1024 Apr 7, 2025

Choose a reason for hiding this comment

Uh oh!

liurenjie1024 Apr 7, 2025

Choose a reason for hiding this comment

Uh oh!

jonathanc-n commented Apr 7, 2025

Uh oh!

liurenjie1024 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jonathanc-n commented Mar 13, 2025 •

edited

Loading