Skip to content

Upcasting and Downcasting inconsistencies with PyArrow Schema #791

@sungwy

Description

@sungwy

Apache Iceberg version

0.6.0 (latest release)

Please describe the bug 🐞

schema_to_pyarrow converts BinaryType to pa.large_binary() type. This creates inconsistencies with the arrow table schema produced from the data scan between:

  1. when schema_to_pyarrow is used when there is no data in the table (pa.large_binary())
  2. when we use the physical_schema of the file fragment to read the table (pa.binary())

Related PR: #409

The implication of this bug is that pa.Table read from the same Iceberg Table may yield different schema based on whether or not there is data within the defined table scan.

More importantly, it also means that if one of the files is empty, and another file has data within the same table scan, then the schema inconsistencies in the two arrow tables will result in an error as we attempt to pa.concat_tables(tables)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions