Skip to content

ENH: use native filesystem (if available) for read_parquet with pyarrow engine #41194

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

jorisvandenbossche
Copy link
Member

Some timings in https://issues.apache.org/jira/browse/ARROW-12428 suggest that it can be beneficial for reading parquet from S3 to use the pyarrow filesystem implementation, instead of pandas always using fsspec if installed (especially when selecting columns).

@jorisvandenbossche jorisvandenbossche added the IO Parquet parquet, feather label Apr 28, 2021
@@ -172,9 +172,25 @@ def write(

table = self.api.Table.from_pandas(df, **from_pandas_kwargs)

filesystem = kwargs.pop("filesystem", None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a reason we are not actually exposing filesystem as a named keyword?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a reason we are not actually exposing filesystem as a named keyword?

That could also be an option. I think the problem is that right now, this is a keyword that is just passed through to the underlying engine (as we do for other engine-specific keywords as well), and filesystem is not actually a keyword supported by fastparquet.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, to be more correct, fastparquet supports it but names it fs. So we could indeed add a keyword filesystem, and map it to fs for fastparquet?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes this sounds reasonable

@github-actions
Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Jun 25, 2021
@alimcmaster1
Copy link
Member

@jorisvandenbossche - was there much else that needed doing on this? Happy to help as this is functionality I use a lot :)

@TomAugspurger
Copy link
Contributor

Might want to document this behavior. If so, here's my docs from the geopandas PR:

        When no storage options are provided and a filesystem is implemented by
        both ``pyarrow.fs`` and ``fsspec`` (e.g. "s3://") then the ``pyarrow.fs``
        filesystem is preferred. Provide the instantiated fsspec filesystem using
        the ``filesystem`` keyword if you wish to use its implementation.

@jreback
Copy link
Contributor

jreback commented Nov 28, 2021

this is quite old, happen to reopen if actively worked on.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche i fyou can rebase and move the note to 1.5

@@ -172,9 +172,25 @@ def write(

table = self.api.Table.from_pandas(df, **from_pandas_kwargs)

filesystem = kwargs.pop("filesystem", None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes this sounds reasonable

@mroeschke mroeschke added the Arrow pyarrow functionality label Dec 17, 2022
@simonjayhawkins
Copy link
Member

closing as stale

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arrow pyarrow functionality Enhancement IO Parquet parquet, feather Stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants