Open
Description
Describe the enhancement requested
For PyIceberg recently, concatenation of tables has been added: #36846 To add new fields I concat the requested schema with the data that was loaded. However, now I'm hitting the next barrier, unable to project the schemas of nested structs.
Bit of context. For the top-level schema it is not an issue because we can select the columns that we need when reading in the table, but it doesn't allow selection of nested columns.
Selecting a subset:
➜ Desktop python3
Python 3.11.6 (main, Oct 2 2023, 13:45:54) [Clang 15.0.0 (clang-1500.0.40.1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyarrow as pa
>>>
>>> current_schema = pa.schema([pa.field("x", pa.float32()), pa.field("y", pa.float32())])
>>> tbl = pa.Table.from_pylist(
... [
... {"x": 52.371807, "y": 4.896029},
... {"x": 52.387386, "y": 4.646219},
... {"x": 52.078663, "y": 4.288788},
... ],
... schema=current_schema,
... )
>>> schema_with_z = pa.schema(
... [
... pa.field("x", pa.float32()),
... ]
... )
>>> tbl.cast(schema_with_z)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "pyarrow/table.pxi", line 3793, in pyarrow.lib.Table.cast
ValueError: Target schema's field names are not matching the table's field names: ['x', 'y'], ['x']
Or in a nested struct:
➜ Desktop python3
Python 3.11.6 (main, Oct 2 2023, 13:45:54) [Clang 15.0.0 (clang-1500.0.40.1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyarrow as pa
>>>
>>> current_schema = pa.schema(
... pa.field(
... "location",
... pa.struct([pa.field("x", pa.float32()), pa.field("y", pa.float32())]),
... )
... )
>>>
>>> tbl = pa.Table.from_pylist(
... [
... {"location": {"x": 52.371807, "y": 4.896029}},
... {"location": {"x": 52.387386, "y": 4.646219}},
... {"location": {"x": 52.078663, "y": 4.288788}},
... ],
... schema=current_schema,
... )
>>> schema_without_x = pa.schema(
... pa.field(
... "location",
... pa.struct(
... [
... pa.field("x", pa.float32()),
... ]
... ),
... )
... )
>>> tbl.cast(schema_without_x)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "pyarrow/table.pxi", line 3793, in pyarrow.lib.Table.cast
ValueError: Target schema's field names are not matching the table's field names: ['x', 'y'], ['x']
Any thoughts on adding this? Or can we achieve this in another way?
Component(s)
Python