-
Notifications
You must be signed in to change notification settings - Fork 295
Upcasting and Downcasting inconsistencies with PyArrow Schema #791
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Working on this issue, I noticed that Parquet has a restriction, where data larger than 2GB cannot be stored - at the very least Arrow has a check that prevents this:
https://github.com/apache/arrow/blob/main/cpp/src/parquet/encoding.cc#L169 If Parquet cannot hold data that is larger than 2GB, is there a benefit in supporting |
I'm seeing the same restriction when using PolaRs write_parquet, so it looks like a Parquet limitation, instead of an Arrow restriction:
|
This is interesting, why would Polars go with |
For Arrow, the
See: pola-rs/polars#7422 Still, I think the inconsistency is not good. |
My apologies - I think I might not have done a good job explaining the problem @Fokko . I think the issue is with Parquet, not Arrow or PolaRs. I'm using these two libraries as examples to show that writing a record that exceeds 2GB, even if they are able to be represented in memory as large Arrow data type, cannot be written into a Parquet file. This issue raised on PolaRs seems to reiterate that issue as well: pola-rs/polars#10774 This is just based on my research this week, so it is definitely possible that I'm missing something here, but so far I haven't been able to write an actually large record (>2GB) into Parquet |
I agree that you cannot write a single field of 2GB+ to a parquet file. In that case, Parquet is probably not the best way of storing such a big blob. data = 'foobararrow'
offsets = [0, 3, 6, 11] If the offsets are 32 bits, then you need to chunk them into smaller buffers, which negatively impacts performance. |
Gotcha - thank you for the explanation @Fokko I didn't think of how using a large_binary could actually improve the performance because the data is grouped together into large buffers. I think I might have been convoluting the issue with the 2GB limit of Parquet with that of the necessity of using a large type. Simply, I was asking: if we can't write that large of a data into Parquet, why do we even bother using a type that is specifically designed to be able to support larger data (which can't be written into the file)? But now I see that the motivation to support large types is different from the motivation to write larger data, and plus we might introduce a different file format as the storage medium that could support writing larger types in the future |
Uh oh!
There was an error while loading. Please reload this page.
Apache Iceberg version
0.6.0 (latest release)
Please describe the bug 🐞
schema_to_pyarrow
converts BinaryType topa.large_binary()
type. This creates inconsistencies with the arrow table schema produced from the data scan between:Related PR: #409
The implication of this bug is that pa.Table read from the same Iceberg Table may yield different schema based on whether or not there is data within the defined table scan.
More importantly, it also means that if one of the files is empty, and another file has data within the same table scan, then the schema inconsistencies in the two arrow tables will result in an error as we attempt to
pa.concat_tables(tables)
The text was updated successfully, but these errors were encountered: