[Python][Dataset] Support using dataset API in pyarrow.parquet with a minimal ParquetDataset shim #17077

asfimport · 2020-03-09T16:43:34Z

Assemble a minimal ParquetDataset shim backed by pyarrow.dataset.*. Replace the existing ParquetDataset with the shim by default, allow opt-out for users who need the current ParquetDataset

This is mostly exploratory to see which of the python tests fail

Reporter: Ben Kietzman / @bkietz
Assignee: Joris Van den Bossche / @jorisvandenbossche

Related issues:

[Python] More graceful reading of empty String columns in ParquetDataset (is depended upon by)
[Python][Parquet][C++] Null values in a single partition of Parquet dataset, results in invalid schema on read (is depended upon by)
[Python] Datatypes not preserved for partition fields in roundtrip to partitioned parquet dataset (is depended upon by)
[Python] ParquetDataset().read columns argument always returns partition column (is depended upon by)
[Python] Underscores in partition (string) values are dropped when reading dataset (is depended upon by)
[Python] better error message on creating ParquetDataset from empty directory (is depended upon by)
[Python] raise error message when passing invalid filter in parquet reading (is depended upon by)
[C++][Python] Support AWS Firehose partition_scheme implementation for Parquet datasets (is depended upon by)
[C++][Dataset] Automatically detect boolean partition columns (is depended upon by)
[Python][C++][Parquet] Support reading Parquet files having a permutation of column order (is depended upon by)
[Python] Improved workflow for loading an arbitrary collection of Parquet files (is depended upon by)
[Python] RowGroup filtering on file level (is depended upon by)

PRs and other links:

GitHub Pull Request #6303

_{Note: This issue was originally created as ARROW-8039. Please see the migration documentation for further details.}

The text was updated successfully, but these errors were encountered:

asfimport · 2020-03-10T00:09:19Z

Neal Richardson / @nealrichardson:
We might focus this by saying that the objective is to satisfy the .read() method of ParquetDataset and to at least support the filters argument to the init method (with the bonus feature that you can filter on any column, not just partition keys, as an incentive to use the new code). This would exclude supporting object attributes like "pieces", which we could address separately for dask et al..

See https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetDataset.html and https://arrow.apache.org/docs/python/parquet.html#partitioned-datasets-multiple-files.

asfimport · 2020-03-10T18:59:31Z

Joris Van den Bossche / @jorisvandenbossche:

We might focus this by saying that the objective is to satisfy the .read() method of ParquetDataset and to at least support the filters argument to the init method (with the bonus feature that you can filter on any column, not just partition keys, as an incentive to use the new code).

If that is the goal, I think this should be trivial. Which isn't to say it's not useful! Being able to run part of the tests with might discover issues. I did something similar for the read_table function at #6303 (the utility code to convert old-format filters to the new expressions might be useful here as well). In case this issue is not yet started, I could also add this to that PR tomorrow.
This would also stress test the manifest / dataset discovery part (which has a custom python implementation, so that would be useful to compare to what the datasets API does), but not sure the tests for this are very extensive.

This would exclude supporting object attributes like "pieces", which we could address separately for dask et al..

Yes, but this are the hard parts (and the parts that dask extensively uses). So it's mostly for those parts that we will need to decide whether we want to try to create an API-compatible shim, or rather try to provide the necessary features to be able to migrate to the new API.

asfimport · 2020-03-10T19:05:29Z

Neal Richardson / @nealrichardson:
Right, my thought was that we'd solve the Dask problem a different way, like you already started exploring.

asfimport · 2020-03-10T22:33:21Z

Joris Van den Bossche / @jorisvandenbossche:
So a more specific comment here: if basically the only thing that such a ParquetDataset would support is a "read()" function, I am not sure what the benefit of such a ParquetDataset class would be compared to the parquet.read_table function (which also supports reading with column selection / row filter) (when designing a new API).

And supporting this in read_table I actually already did in #6303

asfimport · 2020-03-10T22:47:06Z

Neal Richardson / @nealrichardson:
Ah, good call. That sounds reasonable to me (as someone who is not a user). And it looks like it is trivial enough to promote only read_table and not mention ParquetDataset in https://arrow.apache.org/docs/python/parquet.html#partitioned-datasets-multiple-files.

So the idea would be that read_table would be the function that gets the new Dataset option, and ParquetDataset would be unchanged (just no longer encouraged for use).

@wesm thoughts?

asfimport · 2020-03-10T23:07:12Z

Joris Van den Bossche / @jorisvandenbossche:

So the idea would be that read_table would be the function that gets the new Dataset option, and ParquetDataset would be unchanged (just no longer encouraged for use).

That would be an option, yes.

To give some context from dask's usage: they actually do not use the ParquetDataset.read() method. They use a lot of other things of the class: get the partitioning information, the pieces, the metadata, etc, but not read the full dataset. For reading, they use ParquetDatasetPiece.read().

Now, dask's usage is maybe not typical, so it would be good to check some other places on how ParquetDataset gets used.

For example on StackOverflow:

Top answer on reading partitioned dataset on S3 uses ParquetDataset().reade().to_pandas(): https://stackoverflow.com/questions/45043554/how-to-read-a-list-of-parquet-files-from-s3-as-a-pandas-dataframe-using-pyarrow/48809552#48809552
Some other, less popular S3 related questions that also mention ParquetDataset with basically the same usage pattern

Now, there might still be value in having a two-step way (creating the dataset, and reading) instead of a 1 step read_table, since the former allows to do some inspection of the dataset before reading it.
But this is what the pyarrow.dataset.Dataset already provides. So the question is if a ParquetDataset then is needed?

I suppose such a subclass might be useful to directly expose the parquet specific things (eg without needing to specify format="parquet", or by exposing ParquetFileFormatOptions directly in the constructor of ParquetDataset, etc). I think something like this is useful, but then I would rather model it after dataset.Dataset, to make it consistent with that new API, rather than model it after parquet.ParquetDataset (which would introduce an inconsistencies with the new API), maybe just with a read() method for basic backwards compatibility (but for the rest following the API of dataset.Dataset)

asfimport · 2020-03-10T23:14:07Z

Wes McKinney / @wesm:
If it were possible to maintain backward compatibility for simple use cases (with a FutureWarning) for a period of time that would seem ideal. As long we're on a path to address the use cases with the new code. It seems fine if ParquetDataset eventually goes away entirely

asfimport · 2020-03-23T15:52:57Z

Joris Van den Bossche / @jorisvandenbossche:
I expanded my existing PR for read_table (#6303) with a small ParquetDataset shim, that at least should have the basic ParquetDataset(..).read() work.

Right now I added a use_dataset=False/True keyword (with a default of False), so you can opt in to use the new dataset API under the hood (and to allow me to use this in the tests). But the final end user API we want to provide for this should still be discussed.

asfimport · 2020-04-09T13:58:11Z

Ben Kietzman / @bkietz:
Issue resolved by pull request 6303
#6303

asfimport closed this as completed Apr 9, 2020

asfimport assigned jorisvandenbossche Jan 10, 2023

asfimport added this to the 0.17.0 milestone Jan 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Python][Dataset] Support using dataset API in pyarrow.parquet with a minimal ParquetDataset shim #17077

[Python][Dataset] Support using dataset API in pyarrow.parquet with a minimal ParquetDataset shim #17077

asfimport commented Mar 9, 2020 •

edited

Loading

asfimport commented Mar 10, 2020

Uh oh!

asfimport commented Mar 10, 2020

Uh oh!

asfimport commented Mar 10, 2020

Uh oh!

asfimport commented Mar 10, 2020

Uh oh!

asfimport commented Mar 10, 2020

Uh oh!

asfimport commented Mar 10, 2020

Uh oh!

asfimport commented Mar 10, 2020

Uh oh!

asfimport commented Mar 23, 2020

Uh oh!

asfimport commented Apr 9, 2020

Uh oh!

[Python][Dataset] Support using dataset API in pyarrow.parquet with a minimal ParquetDataset shim #17077

[Python][Dataset] Support using dataset API in pyarrow.parquet with a minimal ParquetDataset shim #17077

Comments

asfimport commented Mar 9, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Related issues:

PRs and other links:

asfimport commented Mar 10, 2020

Uh oh!

asfimport commented Mar 10, 2020

Uh oh!

asfimport commented Mar 10, 2020

Uh oh!

asfimport commented Mar 10, 2020

Uh oh!

asfimport commented Mar 10, 2020

Uh oh!

asfimport commented Mar 10, 2020

Uh oh!

asfimport commented Mar 10, 2020

Uh oh!

asfimport commented Mar 23, 2020

Uh oh!

asfimport commented Apr 9, 2020

Uh oh!

asfimport commented Mar 9, 2020 •

edited

Loading