validation for static overwrite with filter #4
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Section 1: Background
When we are doing a static overwrite, we could choose to overwrite the full table or overwrite some partitions of the table.
Example spark sql counterpart in iceberg spark static overwrite for full table overwrite is
And example spark sql counterpart for specified partition overwrite is
Section 2: Goal of the Pull Request
When we overwrite the table, we could provide an expression as overwrite_filter in accord with these 2 cases. The filter expression should conform to certain rules such as it has to be on a partition column that does not use hidden partitioning and the fields in the filter have to be in accord with the input arrow table in a certain way so that the new partition has the partition value auto-filled in accord with the removed partition. This pull request aims at logics of:
Section 3: Rules and Test Cases
Rule : The expression could only use IsNull or EqualTo as building blocks and concatenated by And
Tests :
test__validate_static_overwrite_filter_expr_type
parametrize 1-8Rule : The building block predicates (IsNull and EqualTo) should not have conflicting values.
Tests :
test__validate_static_overwrite_filter_expr_type
parametrize 9-11Rule : The terms (fields) should refer to existing fields in the iceberg schema, and also the literal in the predicate (if any) should match the iceberg field type. These mean the expression could be bound with table schema successfully.
Tests :
test__bind_and_validate_static_overwrite_filter_predicate_fails_on_non_schema_fields_in_filter
test__bind_and_validate_static_overwrite_filter_predicate_fails_to_bind_due_to_incompatible_predicate_value
Rule : If expression specifies a field which is required in iceberg schema, it should not be isnull in the expression.
Tests :
test__bind_and_validate_static_overwrite_filter_predicate_fails_to_bind_due_to_non_nullable
Rule : The fields in the expression should be within partition columns
Tests :
test__bind_and_validate_static_overwrite_filter_predicate_fails_on_non_part_fields_in_filter
Rule : The iceberg table fields specified in the expression could not have hidden partitioning, however, the non-specified fields could.
Tests :
test__bind_and_validate_static_overwrite_filter_predicate_fails_on_non_identity_transorm_filter
test__bind_and_validate_static_overwrite_filter_predicate_succeeds_on_an_identity_transform_field_although_table_has_other_hidden_partition_fields
Rule : Three-way relationship between filter, arrow table and iceberg table schema

The fields in filter should not appear in the input arrow table. However when we remove these fields from iceberg schema, the remaining schema should match the arrow table schema precisely in terms of field names, nullability and type.
Tests :
case 2:
test__check_schema_with_filter_fails_on_missing_field
case 3:
test__check_schema_with_filter_fails_due_to_filter_and_dataframe_holding_same_field
case 4:
test__check_schema_with_filter_fails_on_nullability_mismatch
case 4:
test__check_schema_with_filter_fails_on_type_mismatch
case 5:
test__check_schema_with_filter_succeed
case 5:
test__check_schema_with_filter_succeed_on_pyarrow_table_with_random_column_order
for this change https://github.com/jqin61/iceberg-python/pull/4/files#diff-8d5e63f2a87ead8cebe2fd8ac5dcf2198d229f01e16bb9e06e21f7277c328abdR1739Section 4: Flow and Scedo Code Logics
Entire Flow:

Rule 7 Elaboration:

Section 5: Rule Necessity Justification - Spark Counterparts
To better understand these rules, let us provide spark static overwrite crash counterparts. For which, we have following set up:
this gives such table schema:
with such data:
Now let us check the rules
Rule 1. The expression could only use IsNull or EqualTo as building blocks and concatenated by And.
For example:
or
"foo = 'hello' AND (baz IS NULL AND boo = 'hello')
Spark counterpart example:
gives:
Other predicates of 'in', '!=', etc and other expression such as 'Or' give similar errors.
**Rule 2. The building block predicates (IsNull and EqualTo) should not have conflicting values. **
This means
and
are not allowed.
However,
is allowed and shall be deduplicated.
Spark counterpart example:
gives
Rule 3. The terms (fields) should refer to existing fields in the iceberg schema, and also the literal in the predicate (if any) should match the iceberg field type. These mean the expression could be bound with table schema successfully.
Spark counterpart example:
gives:
Rule 4. If expression specifies a field which is required in iceberg schema, it should not be isnull in the expression.
Spark counterpart example:
gives:
Rule 5. The fields in the expression should be within partition columns
Spark counterpart example:
gives:
Rule 6. The iceberg table fields specified in the expression could not have hidden partitioning, however, the non-specified fields could.
Spark counterpart example:
gives:
However, if we specify the other partition column with identity transform, it works:
Rule 7. Three-way relationship between filter, arrow table and iceberg table schema
Rule 7 Case 1
gives:
Rule 7 Case 2:
gives
Rule 7 Case 3
gives
Rule 7 Case 4
gives:
Special Case Rule 7
Please notice: Static overwrite with filter includes a complex scenario where filter has a subset of partition columns and the arrow table has the rest where full partition key is dynamically discovered for those non-specified partition fields.
gives:

For 3 way comparison, we need to print the mismatches clearly, sample of rule 7 printing:
https://github.com/jqin61/iceberg-python/pull/4/files#diff-f3df6bef7f6c6d2b05df9a09e0039ff4391ebb66725d41bb0f27f26bb2eb4047R1213