Skip to content

Update pyarrow dependency from 1.0.1 to 3.0 #48014

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
timlod opened this issue Aug 9, 2022 · 2 comments · Fixed by #49096
Closed

Update pyarrow dependency from 1.0.1 to 3.0 #48014

timlod opened this issue Aug 9, 2022 · 2 comments · Fixed by #49096
Labels
Arrow pyarrow functionality Dependencies Required and optional dependencies

Comments

@timlod
Copy link
Contributor

timlod commented Aug 9, 2022

Pandas 1.4 currently requires pyarrow 1.0.1 (released August 2020).
This issue is about discussing an update to the required pyarrow version, as suggested in #47781.

#47781 implements a performance improvement that would require pyarrow 3.0 (released January 2021).

Pyarrow releases now move pretty fast, with new releases coming out approx. every 3 months that add major functionality. As such, efforts such as arrow-backed storage would probably also gain from regularly updating the pyarrow dependency in pandas (as has been done in previous versions).

@phofl phofl added the Dependencies Required and optional dependencies label Aug 10, 2022
@mroeschke mroeschke added the Arrow pyarrow functionality label Aug 11, 2022
@mroeschke
Copy link
Member

I would be in favor of upgrading even to 4.0 (released April 26, 2021). We have had some CI issues related to pyarrow csv reading for versions 2 & 3

cc @jorisvandenbossche

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Aug 12, 2022

Note that we already required a newer pyarrow version specifically for the StringDtype functionality in the past. You can see:

def _chk_pyarrow_available() -> None:
if pa_version_under1p01:
msg = "pyarrow>=1.0.0 is required for PyArrow backed ArrowExtensionArray."
raise ImportError(msg)

(that was from a time we still supported older pyarrow versions. Given we now require 1.0.1 globally, this check is a bit obsolete)

Just to say that we could easily bump the required pyarrow version for the StringDtype, while still allowing pyarrow 1.0 for the Parquet IO.


Now, I am certainly not against increasing our minimum version. But as a data point, we noticed a month ago that based on PyPI download data, pyarrow 2.0 is still widely used .. (there is probably some often used package that has that pin) In general we notice that there are quite some downstream packages of pyarrow that lag behind in supporting the latest pyarrow versions.
Of course, you can also say that people pinning to older pyarrow can also use an older pandas ..

For example for numpy the rule is all versions released in the 24 months prior to the project release (https://numpy.org/neps/nep-0029-deprecation_policy.html), if we would take the same rule for pyarrow that would mean we can drop pyarrow 1.0 now, but would still support pyarrow 2.0 (and we could start requiring pyarrow 3.0 in the release after 1.5 / starting from October)

@mroeschke mroeschke changed the title Update pyarrow dependency Update pyarrow dependency from 1.0.1 to 3.0 Aug 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arrow pyarrow functionality Dependencies Required and optional dependencies
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants