-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
Update pyarrow dependency from 1.0.1 to 3.0 #48014
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I would be in favor of upgrading even to 4.0 (released April 26, 2021). We have had some CI issues related to pyarrow csv reading for versions 2 & 3 |
Note that we already required a newer pyarrow version specifically for the StringDtype functionality in the past. You can see: pandas/pandas/core/arrays/string_arrow.py Lines 56 to 59 in c8fc47b
(that was from a time we still supported older pyarrow versions. Given we now require 1.0.1 globally, this check is a bit obsolete) Just to say that we could easily bump the required pyarrow version for the StringDtype, while still allowing pyarrow 1.0 for the Parquet IO. Now, I am certainly not against increasing our minimum version. But as a data point, we noticed a month ago that based on PyPI download data, pyarrow 2.0 is still widely used .. (there is probably some often used package that has that pin) In general we notice that there are quite some downstream packages of pyarrow that lag behind in supporting the latest pyarrow versions. For example for numpy the rule is all versions released in the 24 months prior to the project release (https://numpy.org/neps/nep-0029-deprecation_policy.html), if we would take the same rule for pyarrow that would mean we can drop pyarrow 1.0 now, but would still support pyarrow 2.0 (and we could start requiring pyarrow 3.0 in the release after 1.5 / starting from October) |
Pandas 1.4 currently requires pyarrow 1.0.1 (released August 2020).
This issue is about discussing an update to the required pyarrow version, as suggested in #47781.
#47781 implements a performance improvement that would require pyarrow 3.0 (released January 2021).
Pyarrow releases now move pretty fast, with new releases coming out approx. every 3 months that add major functionality. As such, efforts such as arrow-backed storage would probably also gain from regularly updating the pyarrow dependency in pandas (as has been done in previous versions).
The text was updated successfully, but these errors were encountered: