Skip to content

Support 32-bit Utf8/Binary/List types #7422

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
stinodego opened this issue Mar 8, 2023 · 6 comments
Closed

Support 32-bit Utf8/Binary/List types #7422

stinodego opened this issue Mar 8, 2023 · 6 comments
Labels
enhancement New feature or an improvement of an existing feature

Comments

@stinodego
Copy link
Contributor

stinodego commented Mar 8, 2023

Problem description

Polars Utf8/Binary/List datatypes are currently only available in their 64-bit variant. In Arrow these are known as large_string/large_binary/large_list.

Arrow also has a 32-bit version for these: string/binary/list_.

The 32-bit versions are probably sufficient for many use cases, and will be more efficient. Supporting these will also allow zero-copy conversions from these types into Polars.

@stinodego stinodego added the enhancement New feature or an improvement of an existing feature label Mar 8, 2023
@ghuls
Copy link
Collaborator

ghuls commented Mar 8, 2023

I don't think their are that much more efficient, probably only the pointers and length of the array are 32-bit instead of 64-bit, but for the data storage part itself there is no change.

Rechunking to one chunk might not be possible with the 32-bit variants, but maybe in combination with the streaming API, this is less of a problem.

@ritchie46
Copy link
Member

IMO the small versions are rather worthless for polars' goals. They keep offsets with i32 values so they only can keep 2GB of string data in a single column. This limit was hit on the first week someone used polars. That means we must do all kinds of complicated chunking because our columns might overflow. Other than that, take operations on small strings will often be working on chunkedarrays, meaning they do a double indirection, killing performance.

We did support them in the past, but because of problems mentioned above we switched to the large variants.

@stinodego
Copy link
Contributor Author

All right, good to have some context with that decision!

I ran into this because Delta does not support large types.

I'll close this as "not planned" for now.

@stinodego stinodego closed this as not planned Won't fix, can't repro, duplicate, stale Mar 8, 2023
@ritchie46
Copy link
Member

We can add an option to convert to small types when calling to_arrow.

@stinodego
Copy link
Contributor Author

That would be helpful - I was actually implementing some casting functionality in pyarrow, but I guess it's faster to do it as part of the original conversion in Rust.

I also need to cast all unsigned types to signed types, and all Datetime types to use microseconds. But I can do that in Polars.

Overall, write_delta is a bit more tricky than I initially thought 😄

@stinodego
Copy link
Contributor Author

I guess this is related: #7431

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature
Projects
None yet
Development

No branches or pull requests

3 participants