Reported and actual arrow schema of the table can be different

This is related to https://github.com/apache/iceberg-rust/issues/783.

Namely what happens is
- I use `pyiceberg` to create an Iceberg table from a Parquet file.
- The Parquet file has type hints for e.g. `DataType::Int16` (`required int32 c1 (INTEGER(16,true)) = 1;`).
- Thanks to https://github.com/apache/iceberg-rust/issues/783 we now upcast that to the native 32-bit Int type and can read it.
- This is also the type returned in e.g. `TableProvider::schema`.
- However the actual type in the read arrow record batches (inferred from the Parquet hint) is now `DataType::Int16`, leading to reported and actual schema mismatch.
- This now leads to a DataFusion query such as `SELECT c1 FROM t where c1 <= 2` crashing with `Invalid comparison operation: Int16 <= Int32`
- Ultimately the schema mismatch tricks one of the logical optimizers into thinking that if it casts the right side (i.e. the `2` literal) into `DataType::Int32` (from the reported schema) the comparison will be fine.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reported and actual arrow schema of the table can be different #813

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Reported and actual arrow schema of the table can be different #813

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions