Skip to content

Reported and actual arrow schema of the table can be different #813

Open
@gruuya

Description

@gruuya

This is related to #783.

Namely what happens is

  • I use pyiceberg to create an Iceberg table from a Parquet file.
  • The Parquet file has type hints for e.g. DataType::Int16 (required int32 c1 (INTEGER(16,true)) = 1;).
  • Thanks to Discussion: Support conversion of Arrow Int8 and Int16 to PrimitiveType::Int #783 we now upcast that to the native 32-bit Int type and can read it.
  • This is also the type returned in e.g. TableProvider::schema.
  • However the actual type in the read arrow record batches (inferred from the Parquet hint) is now DataType::Int16, leading to reported and actual schema mismatch.
  • This now leads to a DataFusion query such as SELECT c1 FROM t where c1 <= 2 crashing with Invalid comparison operation: Int16 <= Int32
  • Ultimately the schema mismatch tricks one of the logical optimizers into thinking that if it casts the right side (i.e. the 2 literal) into DataType::Int32 (from the reported schema) the comparison will be fine.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions