Skip to content

[Python][Dataset] Support using dataset API in pyarrow.parquet with a minimal ParquetDataset shim #17077

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Tracked by #20089
asfimport opened this issue Mar 9, 2020 · 9 comments

Comments

@asfimport
Copy link
Collaborator

asfimport commented Mar 9, 2020

Assemble a minimal ParquetDataset shim backed by pyarrow.dataset.*. Replace the existing ParquetDataset with the shim by default, allow opt-out for users who need the current ParquetDataset

This is mostly exploratory to see which of the python tests fail

Reporter: Ben Kietzman / @bkietz
Assignee: Joris Van den Bossche / @jorisvandenbossche

Related issues:

PRs and other links:

Note: This issue was originally created as ARROW-8039. Please see the migration documentation for further details.

@asfimport
Copy link
Collaborator Author

Neal Richardson / @nealrichardson:
We might focus this by saying that the objective is to satisfy the .read() method of ParquetDataset and to at least support the filters argument to the init method (with the bonus feature that you can filter on any column, not just partition keys, as an incentive to use the new code). This would exclude supporting object attributes like "pieces", which we could address separately for dask et al..

See https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetDataset.html and https://arrow.apache.org/docs/python/parquet.html#partitioned-datasets-multiple-files.

@asfimport
Copy link
Collaborator Author

Joris Van den Bossche / @jorisvandenbossche:

We might focus this by saying that the objective is to satisfy the .read() method of ParquetDataset and to at least support the filters argument to the init method (with the bonus feature that you can filter on any column, not just partition keys, as an incentive to use the new code).

If that is the goal, I think this should be trivial. Which isn't to say it's not useful! Being able to run part of the tests with might discover issues. I did something similar for the read_table function at #6303 (the utility code to convert old-format filters to the new expressions might be useful here as well). In case this issue is not yet started, I could also add this to that PR tomorrow.
This would also stress test the manifest / dataset discovery part (which has a custom python implementation, so that would be useful to compare to what the datasets API does), but not sure the tests for this are very extensive.

This would exclude supporting object attributes like "pieces", which we could address separately for dask et al..

Yes, but this are the hard parts (and the parts that dask extensively uses). So it's mostly for those parts that we will need to decide whether we want to try to create an API-compatible shim, or rather try to provide the necessary features to be able to migrate to the new API.

@asfimport
Copy link
Collaborator Author

Neal Richardson / @nealrichardson:
Right, my thought was that we'd solve the Dask problem a different way, like you already started exploring.

@asfimport
Copy link
Collaborator Author

Joris Van den Bossche / @jorisvandenbossche:
So a more specific comment here: if basically the only thing that such a ParquetDataset would support is a "read()" function, I am not sure what the benefit of such a ParquetDataset class would be compared to the parquet.read_table function (which also supports reading with column selection / row filter) (when designing a new API).

And supporting this in read_table I actually already did in #6303

@asfimport
Copy link
Collaborator Author

Neal Richardson / @nealrichardson:
Ah, good call. That sounds reasonable to me (as someone who is not a user). And it looks like it is trivial enough to promote only read_table and not mention ParquetDataset in https://arrow.apache.org/docs/python/parquet.html#partitioned-datasets-multiple-files.

So the idea would be that read_table would be the function that gets the new Dataset option, and ParquetDataset would be unchanged (just no longer encouraged for use).

@wesm thoughts?

@asfimport
Copy link
Collaborator Author

Joris Van den Bossche / @jorisvandenbossche:

So the idea would be that read_table would be the function that gets the new Dataset option, and ParquetDataset would be unchanged (just no longer encouraged for use).

That would be an option, yes.

To give some context from dask's usage: they actually do not use the ParquetDataset.read() method. They use a lot of other things of the class: get the partitioning information, the pieces, the metadata, etc, but not read the full dataset. For reading, they use ParquetDatasetPiece.read().

Now, dask's usage is maybe not typical, so it would be good to check some other places on how ParquetDataset gets used.

For example on StackOverflow:

  • Top answer on reading partitioned dataset on S3 uses ParquetDataset().reade().to_pandas(): https://stackoverflow.com/questions/45043554/how-to-read-a-list-of-parquet-files-from-s3-as-a-pandas-dataframe-using-pyarrow/48809552#48809552

  • Some other, less popular S3 related questions that also mention ParquetDataset with basically the same usage pattern

    Now, there might still be value in having a two-step way (creating the dataset, and reading) instead of a 1 step read_table, since the former allows to do some inspection of the dataset before reading it.
    But this is what the pyarrow.dataset.Dataset already provides. So the question is if a ParquetDataset then is needed?

    I suppose such a subclass might be useful to directly expose the parquet specific things (eg without needing to specify format="parquet", or by exposing ParquetFileFormatOptions directly in the constructor of ParquetDataset, etc). I think something like this is useful, but then I would rather model it after dataset.Dataset, to make it consistent with that new API, rather than model it after parquet.ParquetDataset (which would introduce an inconsistencies with the new API), maybe just with a read() method for basic backwards compatibility (but for the rest following the API of dataset.Dataset)

@asfimport
Copy link
Collaborator Author

Wes McKinney / @wesm:
If it were possible to maintain backward compatibility for simple use cases (with a FutureWarning) for a period of time that would seem ideal. As long we're on a path to address the use cases with the new code. It seems fine if ParquetDataset eventually goes away entirely

@asfimport
Copy link
Collaborator Author

Joris Van den Bossche / @jorisvandenbossche:
I expanded my existing PR for read_table (#6303) with a small ParquetDataset shim, that at least should have the basic ParquetDataset(..).read() work.

Right now I added a use_dataset=False/True keyword (with a default of False), so you can opt in to use the new dataset API under the hood (and to allow me to use this in the tests). But the final end user API we want to provide for this should still be discussed.

@asfimport
Copy link
Collaborator Author

Ben Kietzman / @bkietz:
Issue resolved by pull request 6303
#6303

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants