-
-
Notifications
You must be signed in to change notification settings - Fork 18.6k
ENH: use native filesystem (if available) for read_parquet with pyarrow engine #41194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: use native filesystem (if available) for read_parquet with pyarrow engine #41194
Conversation
@@ -172,9 +172,25 @@ def write( | |||
|
|||
table = self.api.Table.from_pandas(df, **from_pandas_kwargs) | |||
|
|||
filesystem = kwargs.pop("filesystem", None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a reason we are not actually exposing filesystem
as a named keyword?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a reason we are not actually exposing
filesystem
as a named keyword?
That could also be an option. I think the problem is that right now, this is a keyword that is just passed through to the underlying engine (as we do for other engine-specific keywords as well), and filesystem
is not actually a keyword supported by fastparquet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, to be more correct, fastparquet supports it but names it fs
. So we could indeed add a keyword filesystem
, and map it to fs
for fastparquet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes this sounds reasonable
This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this. |
@jorisvandenbossche - was there much else that needed doing on this? Happy to help as this is functionality I use a lot :) |
Might want to document this behavior. If so, here's my docs from the geopandas PR:
|
this is quite old, happen to reopen if actively worked on. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jorisvandenbossche i fyou can rebase and move the note to 1.5
@@ -172,9 +172,25 @@ def write( | |||
|
|||
table = self.api.Table.from_pandas(df, **from_pandas_kwargs) | |||
|
|||
filesystem = kwargs.pop("filesystem", None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes this sounds reasonable
closing as stale |
Some timings in https://issues.apache.org/jira/browse/ARROW-12428 suggest that it can be beneficial for reading parquet from S3 to use the pyarrow filesystem implementation, instead of pandas always using fsspec if installed (especially when selecting columns).