Skip to content

Allow projection of schemas/structs #38615

Open
@Fokko

Description

@Fokko

Describe the enhancement requested

For PyIceberg recently, concatenation of tables has been added: #36846 To add new fields I concat the requested schema with the data that was loaded. However, now I'm hitting the next barrier, unable to project the schemas of nested structs.

Bit of context. For the top-level schema it is not an issue because we can select the columns that we need when reading in the table, but it doesn't allow selection of nested columns.

Selecting a subset:

Desktop python3
Python 3.11.6 (main, Oct  2 2023, 13:45:54) [Clang 15.0.0 (clang-1500.0.40.1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyarrow as pa
>>> 
>>> current_schema = pa.schema([pa.field("x", pa.float32()), pa.field("y", pa.float32())])
>>> tbl = pa.Table.from_pylist(
...     [
...         {"x": 52.371807, "y": 4.896029},
...         {"x": 52.387386, "y": 4.646219},
...         {"x": 52.078663, "y": 4.288788},
...     ],
...     schema=current_schema,
... )
>>> schema_with_z = pa.schema(
...     [
...         pa.field("x", pa.float32()),
...     ]
... )
>>> tbl.cast(schema_with_z)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pyarrow/table.pxi", line 3793, in pyarrow.lib.Table.cast
ValueError: Target schema's field names are not matching the table's field names: ['x', 'y'], ['x']

Or in a nested struct:

Desktop python3
Python 3.11.6 (main, Oct  2 2023, 13:45:54) [Clang 15.0.0 (clang-1500.0.40.1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyarrow as pa
>>> 
>>> current_schema = pa.schema(
...     pa.field(
...         "location",
...         pa.struct([pa.field("x", pa.float32()), pa.field("y", pa.float32())]),
...     )
... )
>>> 
>>> tbl = pa.Table.from_pylist(
...     [
...         {"location": {"x": 52.371807, "y": 4.896029}},
...         {"location": {"x": 52.387386, "y": 4.646219}},
...         {"location": {"x": 52.078663, "y": 4.288788}},
...     ],
...     schema=current_schema,
... )
>>> schema_without_x = pa.schema(
...     pa.field(
...         "location",
...         pa.struct(
...             [
...                 pa.field("x", pa.float32()),
...             ]
...         ),
...     )
... )
>>> tbl.cast(schema_without_x)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pyarrow/table.pxi", line 3793, in pyarrow.lib.Table.cast
ValueError: Target schema's field names are not matching the table's field names: ['x', 'y'], ['x']

Any thoughts on adding this? Or can we achieve this in another way?

Component(s)

Python

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions