Open
Description
After #349, we support appending DataFile now. But I found there are some check may miss now: When we append DataFile, schema evolution or partition evolution may happen in the table after we generate the DataFile, which will cause the info of DataFile invalid. E.g partition value in DataFile will be invalid when partition evolution happen. lower_bound(upper_bound) will be invalid when schema evolution happen. So we need to detect the case that DataFile is incompatible with table.
For partition evolution, we have two ways to detect:
- Ensure that the partition value schema matches the existing partition spec in terms of type, this is the way we have now. But there are some case it can't detect for this way, e.g. partition spec type <p1: int, p2: int> reorder to <p2: int, p1: int>
- Ensure that the partition value schema matches the existing partition spec in terms of field name or field id.
For schema evolution:
- It may still lead to partition evolution, and the detection method for partition values is the same as mentioned above.
- Check whether the lower_bound/upper_bound is match using the field ID.
Based on the above analysis, we need to make the following fixes:
- The partition in DataFile should include types to facilitate validation. e.g. the field name and field id
- Append operations need to add validation checks for scheme evolution: lower bounds, upper_bound.
I'm not sure whether my understand is correct, please correct me if something wrong. cc @Fokko @liurenjie1024 @Xuanwo
Metadata
Metadata
Assignees
Labels
No labels