-
Notifications
You must be signed in to change notification settings - Fork 3.7k
[Python][Dataset] Support using dataset API in pyarrow.parquet with a minimal ParquetDataset shim #17077
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Neal Richardson / @nealrichardson: See https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetDataset.html and https://arrow.apache.org/docs/python/parquet.html#partitioned-datasets-multiple-files. |
Joris Van den Bossche / @jorisvandenbossche:
If that is the goal, I think this should be trivial. Which isn't to say it's not useful! Being able to run part of the tests with might discover issues. I did something similar for the read_table function at #6303 (the utility code to convert old-format filters to the new expressions might be useful here as well). In case this issue is not yet started, I could also add this to that PR tomorrow.
Yes, but this are the hard parts (and the parts that dask extensively uses). So it's mostly for those parts that we will need to decide whether we want to try to create an API-compatible shim, or rather try to provide the necessary features to be able to migrate to the new API. |
Neal Richardson / @nealrichardson: |
Joris Van den Bossche / @jorisvandenbossche: And supporting this in |
Neal Richardson / @nealrichardson: So the idea would be that read_table would be the function that gets the new Dataset option, and ParquetDataset would be unchanged (just no longer encouraged for use). @wesm thoughts? |
Joris Van den Bossche / @jorisvandenbossche:
That would be an option, yes. To give some context from dask's usage: they actually do not use the ParquetDataset.read() method. They use a lot of other things of the class: get the partitioning information, the pieces, the metadata, etc, but not read the full dataset. For reading, they use ParquetDatasetPiece.read(). Now, dask's usage is maybe not typical, so it would be good to check some other places on how ParquetDataset gets used. For example on StackOverflow:
|
Wes McKinney / @wesm: |
Joris Van den Bossche / @jorisvandenbossche: Right now I added a |
Ben Kietzman / @bkietz: |
Uh oh!
There was an error while loading. Please reload this page.
Assemble a minimal ParquetDataset shim backed by
pyarrow.dataset.*
. Replace the existing ParquetDataset with the shim by default, allow opt-out for users who need the current ParquetDatasetThis is mostly exploratory to see which of the python tests fail
Reporter: Ben Kietzman / @bkietz
Assignee: Joris Van den Bossche / @jorisvandenbossche
Related issues:
PRs and other links:
Note: This issue was originally created as ARROW-8039. Please see the migration documentation for further details.
The text was updated successfully, but these errors were encountered: