ENH: use native filesystem (if available) for read_parquet with pyarrow engine #41194

jorisvandenbossche · 2021-04-28T10:13:54Z

Some timings in https://issues.apache.org/jira/browse/ARROW-12428 suggest that it can be beneficial for reading parquet from S3 to use the pyarrow filesystem implementation, instead of pandas always using fsspec if installed (especially when selecting columns).

…ow engine

jreback · 2021-04-28T14:55:13Z

pandas/io/parquet.py

@@ -172,9 +172,25 @@ def write(

        table = self.api.Table.from_pandas(df, **from_pandas_kwargs)

+        filesystem = kwargs.pop("filesystem", None)


is there a reason we are not actually exposing filesystem as a named keyword?

is there a reason we are not actually exposing filesystem as a named keyword?

That could also be an option. I think the problem is that right now, this is a keyword that is just passed through to the underlying engine (as we do for other engine-specific keywords as well), and filesystem is not actually a keyword supported by fastparquet.

Actually, to be more correct, fastparquet supports it but names it fs. So we could indeed add a keyword filesystem, and map it to fs for fastparquet?

yes this sounds reasonable

github-actions · 2021-06-25T00:02:28Z

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

alimcmaster1 · 2021-08-25T21:47:12Z

@jorisvandenbossche - was there much else that needed doing on this? Happy to help as this is functionality I use a lot :)

…sspec

TomAugspurger · 2021-09-13T11:37:25Z

Might want to document this behavior. If so, here's my docs from the geopandas PR:

        When no storage options are provided and a filesystem is implemented by
        both ``pyarrow.fs`` and ``fsspec`` (e.g. "s3://") then the ``pyarrow.fs``
        filesystem is preferred. Provide the instantiated fsspec filesystem using
        the ``filesystem`` keyword if you wish to use its implementation.

jreback · 2021-11-28T21:06:33Z

this is quite old, happen to reopen if actively worked on.

…sspec

jreback

@jorisvandenbossche i fyou can rebase and move the note to 1.5

jreback · 2022-01-16T17:56:28Z

pandas/io/parquet.py

@@ -172,9 +172,25 @@ def write(

        table = self.api.Table.from_pandas(df, **from_pandas_kwargs)

+        filesystem = kwargs.pop("filesystem", None)


yes this sounds reasonable

simonjayhawkins · 2023-02-22T13:19:43Z

closing as stale

jorisvandenbossche added 2 commits April 28, 2021 11:54

ENH: use native filesystem (if available) for read_parquet with pyarr…

13dff55

…ow engine

update comment

2686641

jorisvandenbossche added the IO Parquet parquet, feather label Apr 28, 2021

jreback requested changes Apr 28, 2021

View reviewed changes

simonjayhawkins added the Enhancement label May 25, 2021

github-actions bot added the Stale label Jun 25, 2021

Merge remote-tracking branch 'upstream/master' into parquet-pyarrow-f…

c3230c1

…sspec

fixup version comparison

d5eab50

jreback closed this Nov 28, 2021

jorisvandenbossche reopened this Dec 7, 2021

jorisvandenbossche added 2 commits December 7, 2021 08:26

Merge remote-tracking branch 'upstream/master' into parquet-pyarrow-f…

ac79a63

…sspec

add docstring

6791f9e

jreback requested changes Jan 16, 2022

View reviewed changes

mroeschke added the Arrow pyarrow functionality label Dec 17, 2022

simonjayhawkins closed this Feb 22, 2023

mroeschke mentioned this pull request Feb 24, 2023

ENH: use native filesystem (if available) for read_parquet with pyarrow engine #51609

Merged

mjperrone mentioned this pull request May 16, 2024

BUG: read_orc does not use the provided filesystem for all operations #58746

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

ENH: use native filesystem (if available) for read_parquet with pyarrow engine #41194

ENH: use native filesystem (if available) for read_parquet with pyarrow engine #41194

Uh oh!

jorisvandenbossche commented Apr 28, 2021

Uh oh!

jreback Apr 28, 2021

Uh oh!

jorisvandenbossche Dec 7, 2021

Uh oh!

jorisvandenbossche Dec 7, 2021

Uh oh!

jreback Jan 16, 2022

Uh oh!

github-actions bot commented Jun 25, 2021

Uh oh!

alimcmaster1 commented Aug 25, 2021

Uh oh!

TomAugspurger commented Sep 13, 2021

Uh oh!

jreback commented Nov 28, 2021

Uh oh!

jreback left a comment

Uh oh!

jreback Jan 16, 2022

Uh oh!

simonjayhawkins commented Feb 22, 2023

Uh oh!

Uh oh!

		@@ -172,9 +172,25 @@ def write(

		table = self.api.Table.from_pandas(df, **from_pandas_kwargs)

		filesystem = kwargs.pop("filesystem", None)

Uh oh!

ENH: use native filesystem (if available) for read_parquet with pyarrow engine #41194

ENH: use native filesystem (if available) for read_parquet with pyarrow engine #41194

Uh oh!

Conversation

jorisvandenbossche commented Apr 28, 2021

Uh oh!

jreback Apr 28, 2021

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche Dec 7, 2021

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche Dec 7, 2021

Choose a reason for hiding this comment

Uh oh!

jreback Jan 16, 2022

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jun 25, 2021

Uh oh!

alimcmaster1 commented Aug 25, 2021

Uh oh!

TomAugspurger commented Sep 13, 2021

Uh oh!

jreback commented Nov 28, 2021

Uh oh!

jreback left a comment

Choose a reason for hiding this comment

Uh oh!

jreback Jan 16, 2022

Choose a reason for hiding this comment

Uh oh!

simonjayhawkins commented Feb 22, 2023

Uh oh!

Uh oh!