-
Notifications
You must be signed in to change notification settings - Fork 297
Regression in 0.7.0 due to type coercion from "string" to "large_string" #1128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
To summarize, given a table created with Expected: return pyarrow dataframe with Confirmed the above issue on |
The issue above is was mentioned here #986 (comment) On read, pyarrow will use large type as default. It is controlled by this table property (courtesy of #986) iceberg-python/pyiceberg/io/pyarrow.py Lines 1365 to 1371 in 9857107
|
As a workaround, you can manually set the table property to force the read path to use the
|
Thanks @kevinjqliu. I can confirm that the workaround resolves the problem when using latest main branch but not v0.7.0 or v0.7.1. Setting I would be tempted to change the default value of A further improvement would be to write some kind of type hint into the iceberg metadata that would tell pyiceberg whether the string column was supposed to be interpreted as a pyarrow |
Yea, there's a separation of Iceberg type and the Arrow/Parquet/on-disk type. Iceberg has one string type; Arrow has two. The problem here is that Perhaps, instead of setting cc @sungwy / @Fokko |
This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible. |
Apache Iceberg version
0.7.0
Please describe the bug 🐞
There is a regression in introduced in version 0.7.0 where arrow tables written with a "string" data type, get cast to "large_string" when read back from Iceberg.
The code below reproduces the bug. The assertion succeeds in v0.6.1, but fails in 0.7.0 because the schema is being changed from "string" to "large_string".
The text was updated successfully, but these errors were encountered: