Skip to content

[feat] push down schema casting to the record batch level #1049

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
kevinjqliu opened this issue Aug 13, 2024 · 1 comment · Fixed by #1669
Closed

[feat] push down schema casting to the record batch level #1049

kevinjqliu opened this issue Aug 13, 2024 · 1 comment · Fixed by #1669

Comments

@kevinjqliu
Copy link
Contributor

kevinjqliu commented Aug 13, 2024

Feature Request / Improvement

# With PyArrow 16.0.0 there is an issue with casting record-batches:
# https://github.com/apache/arrow/issues/41884
# https://github.com/apache/arrow/issues/43183
# Would be good to remove this later on
schema=_pyarrow_schema_ensure_large_types(physical_schema)
if use_large_types
else (_pyarrow_schema_ensure_small_types(physical_schema)),

Related to #929

Requires Arrow 17 (possibly 18?)

Fokko added a commit to Fokko/iceberg-python that referenced this issue Feb 16, 2025
When reading a Parquet file using PyArrow, there is some metadata
stored in the Parquet file to either make it a large type (eg
`large_string`, or a normal type (`string`). The difference is that
the large types use a 64 bit offset to encode their arrays.
This is not always needed, and we can could first check all the
in the types of which it is stored, and let PyArrow decide here:

https://github.com/apache/iceberg-python/blob/300b8405a0fe7d0111321e5644d704026af9266b/pyiceberg/io/pyarrow.py#L1579

In PyArrow today we just bump everything to a large type, which
might lead to additional memory consumption because it allocates
a int64 array to allocate the offsets, instead of an int32.

I thought we would be good to go for this now with the new lower
bound of PyArrow to 17. But, it looks like we still have to wait
for Arrow 18 to fix the issue with the `date` types:

apache/arrow#43183

Fixes: apache#1049
Copy link

github-actions bot commented Mar 1, 2025

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant